Introduction: The Assignment

  1. Choose any three of the “wide” datasets identified in the Week 6/7 Discussion item. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
  1. Please include in your homework submission, for each of the three chosen datasets:
## Loading required package: xml2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dataset 1: Temperature in the Lower 48 (Contiguous U.S. States)

More and more US government agencies are reporting data on global warming and NASA and NOAA are leading the fray. I thought this dataset from NOAA was a good example of compacted data, where multiple observations sharing a few fields are mashed together.

Here is a screen shot of the table from the NOAA website to show where I started.

Retrieving the Temperatures

The data from NOAA downloaded as a CVS file (there were options for XML and json too). I was able to load it easily into a dataframe once I marked a comment line with a # to skip it. It came in as 17 rows with 12 columns, almost square, and not as compacted as it looked on the webpage.

tempdata <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/noaa.csv", header = TRUE, sep = ",", comment.char = "#", stringsAsFactors = FALSE)
tbl_df(tempdata)
## Source: local data frame [17 x 12]
## 
##    Period Value Twentieth.Century.Mean Departure Low.Rank High.Rank
##     (int) (dbl)                  (dbl)     (dbl)    (int)     (int)
## 1       1 72.95                  72.10      0.85       91        31
## 2       2 73.44                  72.86      0.58       87        35
## 3       3 72.75                  71.40      1.35      110        12
## 4       4 69.77                  68.60      1.17      108        14
## 5       5 66.47                  65.09      1.38      115         7
## 6       6 62.96                  61.16      1.80      115         7
## 7       7 58.67                  57.25      1.42      109        13
## 8       8 55.48                  53.86      1.62      113         9
## 9       9 53.40                  51.52      1.88      116         5
## 10     10 51.99                  50.54      1.45      111        10
## 11     11 52.44                  50.86      1.58      113         8
## 12     12 53.58                  52.03      1.55      113         8
## 13     18 56.20                  55.07      1.13      107        14
## 14     24 52.81                  52.02      0.79       97        23
## 15     36 52.88                  52.00      0.88      100        19
## 16     48 53.48                  51.99      1.49      112         6
## 17     60 53.37                  51.98      1.39      111         6
## Variables not shown: Record.Low (int), Record.High (int), Lowest.Since
##   (int), Highest.Since (int), Percentile (chr), Ties (chr)

Tidying the Temperatures

Since they CSV file already broke out the highest and lowest ranks and dates, I decided to look at the same time frame over time to look for a pattern. Looking at the screen shot above we see period 60 is September 2010 to August 2015. Period 48 is September 2011 to August 2015; period 36 is September 2012 to August 2015; period 24 is September 2013 to August 2015; and period 12 is September 2014 to August 2015. These are the rows I want. As for columns I choose Value (temp in F), Mean, Departure (difference) and High.Rank.

tidy.temp <- subset(tempdata, Period >= 12, select = c(Period, Value, Twentieth.Century.Mean, Departure, High.Rank))
colnames(tidy.temp) <- c("Period", "Temp", "Mean", "Diff", "Rank")
tidy.temp <- tidy.temp[-2, ]     # to get rid of row 18
tidy.temp
##    Period  Temp  Mean Diff Rank
## 12     12 53.58 52.03 1.55    8
## 14     24 52.81 52.02 0.79   23
## 15     36 52.88 52.00 0.88   19
## 16     48 53.48 51.99 1.49    6
## 17     60 53.37 51.98 1.39    6

Analyzing the Temperatures

I made a nice small dataset and that doesn’t leave me much to say. We can look at the table in period order and see that each year does not get linearly hotter. Although that most recent year shows the largest difference from the mean, showing it was the warmest, the 2 and 3 year values are not each gradually cooler (which would happen if it got a little warmer each year on average).

If we sort by Temp we see that 1 year was the hottest, followed by 4 and 5. It is good that the temperatures are in sync with the difference from the mean. It is almost a test that the data makes some sense and we see it correctly.

I am not sure I understand Rank. The data for the 4 and 5 year time span rank as the 6th highest, but the 1 year data, which is the highest temperature and difference, ranks as the 8th highest. I don’t have anything else to say about this dataset.

arrange(tidy.temp, Temp)
##   Period  Temp  Mean Diff Rank
## 1     24 52.81 52.02 0.79   23
## 2     36 52.88 52.00 0.88   19
## 3     60 53.37 51.98 1.39    6
## 4     48 53.48 51.99 1.49    6
## 5     12 53.58 52.03 1.55    8
arrange(tidy.temp, Rank)
##   Period  Temp  Mean Diff Rank
## 1     48 53.48 51.99 1.49    6
## 2     60 53.37 51.98 1.39    6
## 3     12 53.58 52.03 1.55    8
## 4     36 52.88 52.00 0.88   19
## 5     24 52.81 52.02 0.79   23

Dataset 2: All the Presidents

I got the idea, URL and analysis questions from Youqing Xiang in our class. I started down the path of reading data in from Wikipedia, because I thought that was the new focus of this project (re-thought that after a post from Andy), and because it reminded me of the Introduction chapter of our textbook on UNESCO sites.

This turned out to be a long path, because of the recent change from http to https on Wikipedia (and most other sites). This made the example code in the textbook not work. I went to the textbook website and followed their suggestion to use Hadley Wickham’s new rvest package.

Retrieving the Presidents

We start by loading rvest and using it to read the List of Presidents from Wikipedia. The result is a List of 2 in prez_html that contains a lot of HTML code I did not print here. The first line is the header information and the second line is the body HTML code. Using the Nodes function allows me to search the HTML code for table. The result is a List of 10 times table is used on the page.

It does not take long to see that lines 1 and 2 are wikitables and our candidates for analysis. I print out table 2 here, since it is smaller and easier to recognize useful information. We have a line for the table header and each living former president. We use the Table function to get the data ready for tidying and analysis. We will see more of the bigger table below. It proves to be more of a problem due to an inconsistent number of columns.

prez.html <- read_html("https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States", encoding = "UTF-8")
prez.tables <- html_nodes(prez.html, "table")
prez.tables
## {xml_nodeset (10)}
##  [1] <table class="wikitable" style="text-align: center;">\n<tr>\n<th>â<U+0084> ...
##  [2] <table class="wikitable">\n<tr>\n<th>President</th>\n<th>Term of of ...
##  [3] <table style="background:#f9f9f9;font-size:85%;line-height:110%;max ...
##  [4] <table class="mbox-small plainlinks sistersitebox" style="border:1p ...
##  [5] <table class="navbox" style="border-spacing:0">\n<tr>\n<td style="p ...
##  [6] <table class="nowraplinks collapsible autocollapse navbox-inner" st ...
##  [7] <table class="navbox" style="border-spacing:0">\n<tr>\n<td style="p ...
##  [8] <table class="nowraplinks collapsible autocollapse navbox-inner" st ...
##  [9] <table class="navbox" style="border-spacing:0">\n<tr>\n<td style="p ...
## [10] <table class="nowraplinks collapsible =expanded navbox-inner" style ...
prez.tables[[2]]
## {xml_node}
## <table>
## [1] <tr>\n<th>President</th>\n<th>Term of office</th>\n<th>Date of birth ...
## [2] <tr>\n<td><a href="/wiki/George_H._W._Bush" title="George H. W. Bush ...
## [3] <tr>\n<td><a href="/wiki/Jimmy_Carter" title="Jimmy Carter">Jimmy Ca ...
## [4] <tr>\n<td><a href="/wiki/George_W._Bush" title="George W. Bush">Geor ...
## [5] <tr>\n<td><a href="/wiki/Bill_Clinton" title="Bill Clinton">Bill Cli ...
prez.former.live <- html_table(prez.tables[[2]], header = TRUE)
prez.former.live
##           President Term of office                          Date of birth
## 1 George H. W. Bush    1989â<U+0080><U+0093>1993   (1924-06-12) June 12, 1924 (age 91)
## 2      Jimmy Carter    1977â<U+0080><U+0093>1981 (1924-10-01) October 1, 1924 (age 91)
## 3    George W. Bush    2001â<U+0080><U+0093>2009    (1946-07-06) July 6, 1946 (age 69)
## 4      Bill Clinton    1993â<U+0080><U+0093>2001 (1946-08-19) August 19, 1946 (age 69)

Tidying the Presidents

We look at the smaller table of Living Former Presidents first before diving into all of the US presidents.

Living Former Presidents

For the smaller table (prez.former.live) we have a dataframe of 4 observations and 3 variables. This is a case of of data not being wide enough. It would be a lot simpler if age was not embedded with date of birth, if date of birth was in one format, and Term of Office was broken out into start and end dates (years).

term.start <- str_extract(prez.former.live$'Term of office', "[1-2]\\d{3}")
term.end.tmp <- str_extract(prez.former.live$'Term of office', "[:punct:][1-2]\\d{3}")
term.end <- str_extract(term.end.tmp, "[1-2]\\d{3}")
prez.dob <- str_extract(prez.former.live$'Date of birth', "\\d{4}-\\d{2}-\\d{2}")
prez.former.live$Start <- as.numeric(term.start)
prez.former.live$End <- as.numeric(term.end)
prez.former.live$`Date of birth` <- as.Date(prez.dob)
prez.former.live <- subset(prez.former.live, select = -2)
colnames(prez.former.live)[2] <- "DOB"
prez.former.live
##           President        DOB Start  End
## 1 George H. W. Bush 1924-06-12  1989 1993
## 2      Jimmy Carter 1924-10-01  1977 1981
## 3    George W. Bush 1946-07-06  2001 2009
## 4      Bill Clinton 1946-08-19  1993 2001

All the Presidents

It turns out we need more than just the Table function to get all the presidents. We need to find a more focused CSS label to look for to attempt to pull out columns of data. In the documentation of rvest I found a utility called SelectorGadget mentioned that helps with this. I used it to find the best selector for the presidents’ names was “b a”, and for birth to death dates “b~ small” and so on. I list out the results of prez.dob here to show the finding of the raw HTML data and conversion into text. After this step I have Names, birth and death dates, states, terms and parties (just for fun).

prez.names <- html_nodes(prez.html, "b a")
prez.names <- html_text(prez.names)
prez.dob <- html_nodes(prez.html, "b~ small")
prez.dob
## {xml_nodeset (44)}
##  [1] <small>(1732â<U+0080><U+0093>1799)</small>
##  [2] <small>(1735â<U+0080><U+0093>1826)</small>
##  [3] <small>(1743â<U+0080><U+0093>1826)</small>
##  [4] <small>(1751â<U+0080><U+0093>1836)</small>
##  [5] <small>(1758â<U+0080><U+0093>1831)</small>
##  [6] <small>(1767â<U+0080><U+0093>1848)</small>
##  [7] <small>(1767â<U+0080><U+0093>1845)</small>
##  [8] <small>(1782â<U+0080><U+0093>1862)</small>
##  [9] <small>(1773â<U+0080><U+0093>1841)</small>
## [10] <small>(1790â<U+0080><U+0093>1862)</small>
## [11] <small>(1795â<U+0080><U+0093>1849)</small>
## [12] <small>(1784â<U+0080><U+0093>1850)</small>
## [13] <small>(1800â<U+0080><U+0093>1874)</small>
## [14] <small>(1804â<U+0080><U+0093>1869)</small>
## [15] <small>(1791â<U+0080><U+0093>1868)</small>
## [16] <small>(1809â<U+0080><U+0093>1865)</small>
## [17] <small>(1808â<U+0080><U+0093>1875)</small>
## [18] <small>(1822â<U+0080><U+0093>1885)</small>
## [19] <small>(1822â<U+0080><U+0093>1893)</small>
## [20] <small>(1831â<U+0080><U+0093>1881)</small>
## ...
prez.dob <- html_text(prez.dob)
prez.dob
##  [1] "(1732â<U+0080><U+0093>1799)" "(1735â<U+0080><U+0093>1826)" "(1743â<U+0080><U+0093>1826)" "(1751â<U+0080><U+0093>1836)"
##  [5] "(1758â<U+0080><U+0093>1831)" "(1767â<U+0080><U+0093>1848)" "(1767â<U+0080><U+0093>1845)" "(1782â<U+0080><U+0093>1862)"
##  [9] "(1773â<U+0080><U+0093>1841)" "(1790â<U+0080><U+0093>1862)" "(1795â<U+0080><U+0093>1849)" "(1784â<U+0080><U+0093>1850)"
## [13] "(1800â<U+0080><U+0093>1874)" "(1804â<U+0080><U+0093>1869)" "(1791â<U+0080><U+0093>1868)" "(1809â<U+0080><U+0093>1865)"
## [17] "(1808â<U+0080><U+0093>1875)" "(1822â<U+0080><U+0093>1885)" "(1822â<U+0080><U+0093>1893)" "(1831â<U+0080><U+0093>1881)"
## [21] "(1829â<U+0080><U+0093>1886)" "(1837â<U+0080><U+0093>1908)" "(1833â<U+0080><U+0093>1901)" "(1837â<U+0080><U+0093>1908)"
## [25] "(1843â<U+0080><U+0093>1901)" "(1858â<U+0080><U+0093>1919)" "(1857â<U+0080><U+0093>1930)" "(1856â<U+0080><U+0093>1924)"
## [29] "(1865â<U+0080><U+0093>1923)" "(1872â<U+0080><U+0093>1933)" "(1874â<U+0080><U+0093>1964)" "(1882â<U+0080><U+0093>1945)"
## [33] "(1884â<U+0080><U+0093>1972)" "(1890â<U+0080><U+0093>1969)" "(1917â<U+0080><U+0093>1963)" "(1908â<U+0080><U+0093>1973)"
## [37] "(1913â<U+0080><U+0093>1994)" "(1913â<U+0080><U+0093>2006)" "(born 1924)"   "(1911â<U+0080><U+0093>2004)"
## [41] "(born 1924)"   "(born 1946)"   "(born 1946)"   "(born 1961)"
prez.states <- html_nodes(prez.html, "td:nth-child(4) a")
prez.states <- html_text(prez.states)
prez.terms <- html_nodes(prez.html, "td > .date")
prez.terms <- html_text(prez.terms)
prez.party <- html_nodes(prez.html, "td:nth-child(6) a")
prez.party <- html_text(prez.party)

We have strings for each of our desired columns of data and a quick check shows they are different lengths. There are 44 presidents in the Wikipedia table, so a multiple of 44 would be good.

  • prez.party length = 48, 4 more than we want
  • prez.terms length = 85, 2 dates per president would be 88
  • prez.states length = 45, 1 more than we need, let’s hope for an extra record at end
  • prez.dob length = 44, our only winner
  • prez.names length = 188, not an even multiple of 44

We need to trim most of these string to make a nice table. Below I start with prez.dob since it is the correct (length) number of values and separate date born from date died.

Born <- str_extract(prez.dob, "[1-2]\\d{3}")
died.tmp <- str_extract(prez.dob, "[1-2]\\d{3}[:punct:]$")
Died <- str_extract(died.tmp, "[1-2]\\d{3}")  # I get the birthdate again for living presidents
Died[Born == Died] <- NA  # that fixes the undead
State <- prez.states[prez.states != "Andrew Johnson"]  # Andrew Johnson was my 45th state
Party <- prez.party[prez.party != c("[14]", "[n 12]", "[n 13]")]  # did not remove [14]???
Party <- Party[Party != "[14]"]  # still 1 too many
Party <- Party[Party != "National Union"]  # a party we don't need
Party <- sub("\n", "", Party)  # get rid of newlines just in case
breakstr <- prez.terms[63:85]  # no dates for FDR, break at Truman
prez.terms[63] <- "March 4, 1933"  # FDR starts
prez.terms[64] <- "April 12, 1945"  # FDR dies after starting 4th term
prez.terms[65:87] <- breakstr
prez.terms[88] <- "January 20, 2017"  # To cover current office holder, Obama
Start <- prez.terms[c(TRUE, FALSE)]
End <- prez.terms[c(FALSE, TRUE)]
President <- prez.names[1:44]  # I'm sure there is a more elegant way
prez.all <- data.frame(President, Born, Died, Start, End, State, Party, stringsAsFactors = FALSE)
prez.all
##                 President Born Died              Start                End
## 1       George Washington 1732 1799     April 30, 1789      March 4, 1797
## 2              John Adams 1735 1826      March 4, 1797      March 4, 1801
## 3        Thomas Jefferson 1743 1826      March 4, 1801      March 4, 1809
## 4           James Madison 1751 1836      March 4, 1809      March 4, 1817
## 5            James Monroe 1758 1831      March 4, 1817      March 4, 1825
## 6       John Quincy Adams 1767 1848      March 4, 1825      March 4, 1829
## 7          Andrew Jackson 1767 1845      March 4, 1829      March 4, 1837
## 8        Martin Van Buren 1782 1862      March 4, 1837      March 4, 1841
## 9  William Henry Harrison 1773 1841      March 4, 1841      April 4, 1841
## 10             John Tyler 1790 1862      April 4, 1841      March 4, 1845
## 11          James K. Polk 1795 1849      March 4, 1845      March 4, 1849
## 12         Zachary Taylor 1784 1850      March 4, 1849       July 9, 1850
## 13       Millard Fillmore 1800 1874       July 9, 1850      March 4, 1853
## 14        Franklin Pierce 1804 1869      March 4, 1853      March 4, 1857
## 15         James Buchanan 1791 1868      March 4, 1857      March 4, 1861
## 16        Abraham Lincoln 1809 1865      March 4, 1861     April 15, 1865
## 17         Andrew Johnson 1808 1875     April 15, 1865      March 4, 1869
## 18       Ulysses S. Grant 1822 1885      March 4, 1869      March 4, 1877
## 19    Rutherford B. Hayes 1822 1893      March 4, 1877      March 4, 1881
## 20      James A. Garfield 1831 1881      March 4, 1881 September 19, 1881
## 21      Chester A. Arthur 1829 1886 September 19, 1881      March 4, 1885
## 22       Grover Cleveland 1837 1908      March 4, 1885      March 4, 1889
## 23      Benjamin Harrison 1833 1901      March 4, 1889      March 4, 1893
## 24       Grover Cleveland 1837 1908      March 4, 1893      March 4, 1897
## 25       William McKinley 1843 1901      March 4, 1897 September 14, 1901
## 26     Theodore Roosevelt 1858 1919 September 14, 1901      March 4, 1909
## 27    William Howard Taft 1857 1930      March 4, 1909      March 4, 1913
## 28         Woodrow Wilson 1856 1924      March 4, 1913      March 4, 1921
## 29      Warren G. Harding 1865 1923      March 4, 1921     August 2, 1923
## 30        Calvin Coolidge 1872 1933     August 2, 1923      March 4, 1929
## 31         Herbert Hoover 1874 1964      March 4, 1929      March 4, 1933
## 32  Franklin D. Roosevelt 1882 1945      March 4, 1933     April 12, 1945
## 33        Harry S. Truman 1884 1972     April 12, 1945   January 20, 1953
## 34   Dwight D. Eisenhower 1890 1969   January 20, 1953   January 20, 1961
## 35        John F. Kennedy 1917 1963   January 20, 1961  November 22, 1963
## 36      Lyndon B. Johnson 1908 1973  November 22, 1963   January 20, 1969
## 37          Richard Nixon 1913 1994   January 20, 1969     August 9, 1974
## 38            Gerald Ford 1913 2006     August 9, 1974   January 20, 1977
## 39           Jimmy Carter 1924 <NA>   January 20, 1977   January 20, 1981
## 40          Ronald Reagan 1911 2004   January 20, 1981   January 20, 1989
## 41      George H. W. Bush 1924 <NA>   January 20, 1989   January 20, 1993
## 42           Bill Clinton 1946 <NA>   January 20, 1993   January 20, 2001
## 43         George W. Bush 1946 <NA>   January 20, 2001   January 20, 2009
## 44           Barack Obama 1961 <NA>   January 20, 2009   January 20, 2017
##            State                 Party
## 1       Virginia           Independent
## 2  Massachusetts            Federalist
## 3       Virginia Democratic-Republican
## 4       Virginia Democratic-Republican
## 5       Virginia Democratic-Republican
## 6  Massachusetts Democratic-Republican
## 7      Tennessee            Democratic
## 8       New York            Democratic
## 9           Ohio                  Whig
## 10      Virginia                  Whig
## 11     Tennessee            Democratic
## 12     Louisiana                  Whig
## 13      New York                  Whig
## 14 New Hampshire            Democratic
## 15  Pennsylvania            Democratic
## 16      Illinois            Republican
## 17     Tennessee            Democratic
## 18      Illinois            Republican
## 19          Ohio            Republican
## 20          Ohio            Republican
## 21      New York            Republican
## 22      New York            Democratic
## 23       Indiana            Republican
## 24      New York            Democratic
## 25          Ohio            Republican
## 26      New York            Republican
## 27          Ohio            Republican
## 28    New Jersey            Democratic
## 29          Ohio            Republican
## 30 Massachusetts            Republican
## 31    California            Republican
## 32      New York            Democratic
## 33      Missouri            Democratic
## 34      New York            Republican
## 35 Massachusetts            Democratic
## 36         Texas            Democratic
## 37    California            Republican
## 38      Michigan            Republican
## 39       Georgia            Democratic
## 40    California            Republican
## 41         Texas            Republican
## 42      Arkansas            Democratic
## 43         Texas            Republican
## 44      Illinois            Democratic

Analyzing the Presidents

Youqing Xiang suggested in her post the following ideas for analysis:

  • Give the list of presidents who had one term (4 years in the office)
  • and presidents who had two terms (8 years in the office)
  • Which president is the oldest
  • What is the average age of all the presidents

Living Former Presidents

With the short list of 4 presidents this is not very complicated or surprising. We use the Mutate function from dplyr to calculate a time (years) in office and the current age of the living presidents (to 1 decimal point to break the ties). Surprisingly (maybe not) they served in order of age with George H. W. Bush being older by about 3 months. I tried using the Summarize function in dplyr, but with so few records it was no different than taking the mean. This will be a little different with all the presidents.

prez.former.live <- mutate(prez.former.live, InOffice = End - Start, Age = as.numeric(round((Sys.Date() - DOB)/365, 1)))
prez.former.live
##           President        DOB Start  End InOffice  Age
## 1 George H. W. Bush 1924-06-12  1989 1993        4 91.4
## 2      Jimmy Carter 1924-10-01  1977 1981        4 91.1
## 3    George W. Bush 1946-07-06  2001 2009        8 69.3
## 4      Bill Clinton 1946-08-19  1993 2001        8 69.2
mean(prez.former.live$Age)
## [1] 80.25

All the Presidents

With the list of all US presidents we can now review Youqing Xiang list of questions and produce some more interesting results.

  • Give the list of presidents who had one term (4 years in the office)
  • and presidents who had two terms (8 years in the office)

This is a good example to figure out what we mean by a question. Here is the list of presidents that served less than 4 years. Most were elected for a term, but did not serve the full 4 years. Gerald Ford was not even elected, but succeeded Nixon when he resigned and was not reelected. We have 10 presidents that served less than 4 years.

prez.all$Start <- as.Date(prez.all$Start, format = "%B %d, %Y")
prez.all$End <- as.Date(prez.all$End, format = "%B %d, %Y")
prez.all <- mutate(prez.all, Days = as.numeric(End - Start), Years = round(Days / 365, 1))
arrange(subset(prez.all, Years < 4, select = c(-State, -Party)), Years)
##                 President Born Died      Start        End Days Years
## 1  William Henry Harrison 1773 1841 1841-03-04 1841-04-04   31   0.1
## 2       James A. Garfield 1831 1881 1881-03-04 1881-09-19  199   0.5
## 3          Zachary Taylor 1784 1850 1849-03-04 1850-07-09  492   1.3
## 4       Warren G. Harding 1865 1923 1921-03-04 1923-08-02  881   2.4
## 5             Gerald Ford 1913 2006 1974-08-09 1977-01-20  895   2.5
## 6        Millard Fillmore 1800 1874 1850-07-09 1853-03-04  969   2.7
## 7         John F. Kennedy 1917 1963 1961-01-20 1963-11-22 1036   2.8
## 8       Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 1262   3.5
## 9              John Tyler 1790 1862 1841-04-04 1845-03-04 1430   3.9
## 10         Andrew Johnson 1808 1875 1865-04-15 1869-03-04 1419   3.9

Here is the list of presidents that served 4 years We get 14 presidents that served their 4 years. We have (10 + 14) 24 presidents either elected to 1 term or finishing a term or just over half (54%) of all presidents.

arrange(subset(prez.all, Years == 4, select = c(-State, -Party)), Years)
##              President Born Died      Start        End Days Years
## 1           John Adams 1735 1826 1797-03-04 1801-03-04 1460     4
## 2    John Quincy Adams 1767 1848 1825-03-04 1829-03-04 1461     4
## 3     Martin Van Buren 1782 1862 1837-03-04 1841-03-04 1461     4
## 4        James K. Polk 1795 1849 1845-03-04 1849-03-04 1461     4
## 5      Franklin Pierce 1804 1869 1853-03-04 1857-03-04 1461     4
## 6       James Buchanan 1791 1868 1857-03-04 1861-03-04 1461     4
## 7  Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 1461     4
## 8     Grover Cleveland 1837 1908 1885-03-04 1889-03-04 1461     4
## 9    Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 1461     4
## 10    Grover Cleveland 1837 1908 1893-03-04 1897-03-04 1461     4
## 11 William Howard Taft 1857 1930 1909-03-04 1913-03-04 1461     4
## 12      Herbert Hoover 1874 1964 1929-03-04 1933-03-04 1461     4
## 13        Jimmy Carter 1924 <NA> 1977-01-20 1981-01-20 1461     4
## 14   George H. W. Bush 1924 <NA> 1989-01-20 1993-01-20 1461     4
  • and presidents who had two terms (8 years in the office)

We get a similar situation when we look at two-term presidents (or more terms in 1 case). How do we count them? Abraham Lincoln was assassinated months after being elected to a second term, as was William McKinley. Lyndon Johnson, Calvin Coolidge, Theodore Roosevelt and Harry Truman finished terms for presidents that died in office, then got (re-) elected. Nixon had to resign office after being reelected. Here is the list of presidents that served more than 4 years, but less than 8. George Washington shows up because technically he served just under 8 years, but they were still working out the details. We get 8 that were elected to a second term (even if they were not elected to the first), but did not serve 8 years in office.

arrange(subset(prez.all, (Years > 4 & Years < 8), select = c(-State, -Party)), Years)
##            President Born Died      Start        End Days Years
## 1    Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 1503   4.1
## 2   William McKinley 1843 1901 1897-03-04 1901-09-14 1654   4.5
## 3  Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 1886   5.2
## 4    Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 2041   5.6
## 5      Richard Nixon 1913 1994 1969-01-20 1974-08-09 2027   5.6
## 6 Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 2728   7.5
## 7  George Washington 1732 1799 1789-04-30 1797-03-04 2865   7.8
## 8    Harry S. Truman 1884 1972 1945-04-12 1953-01-20 2840   7.8

How many presidents served 8 years or more? This gives us 12 presidents that served 8 years or more, 11 of these were regular 2-terms, the 12th was Franklin D. Roosevelt, who was elected to his 4th term when he died in office a few months after winning. After FDR we put in the 2-term limit.

arrange(subset(prez.all, Years >= 8, select = c(-State, -Party)), Years)
##                President Born Died      Start        End Days Years
## 1       Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 2922   8.0
## 2          James Madison 1751 1836 1809-03-04 1817-03-04 2922   8.0
## 3           James Monroe 1758 1831 1817-03-04 1825-03-04 2922   8.0
## 4         Andrew Jackson 1767 1845 1829-03-04 1837-03-04 2922   8.0
## 5       Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 2922   8.0
## 6         Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 2922   8.0
## 7   Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 2922   8.0
## 8          Ronald Reagan 1911 2004 1981-01-20 1989-01-20 2922   8.0
## 9           Bill Clinton 1946 <NA> 1993-01-20 2001-01-20 2922   8.0
## 10        George W. Bush 1946 <NA> 2001-01-20 2009-01-20 2922   8.0
## 11          Barack Obama 1961 <NA> 2009-01-20 2017-01-20 2922   8.0
## 12 Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 4422  12.1

We had 24 presidents that in some way can be called 1-term presidents and we had 20 that were 2-term or more. Here is the full list sorted by days in office that shows the range of 31 to 4,422 days in office. The average number of days in office is 1,890 or just over 5 years.

# arrange(prez.all, Days, End)
arrange(subset(prez.all, Days > 0, select = c(-State, -Party)), Days, End)
##                 President Born Died      Start        End Days Years
## 1  William Henry Harrison 1773 1841 1841-03-04 1841-04-04   31   0.1
## 2       James A. Garfield 1831 1881 1881-03-04 1881-09-19  199   0.5
## 3          Zachary Taylor 1784 1850 1849-03-04 1850-07-09  492   1.3
## 4       Warren G. Harding 1865 1923 1921-03-04 1923-08-02  881   2.4
## 5             Gerald Ford 1913 2006 1974-08-09 1977-01-20  895   2.5
## 6        Millard Fillmore 1800 1874 1850-07-09 1853-03-04  969   2.7
## 7         John F. Kennedy 1917 1963 1961-01-20 1963-11-22 1036   2.8
## 8       Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 1262   3.5
## 9          Andrew Johnson 1808 1875 1865-04-15 1869-03-04 1419   3.9
## 10             John Tyler 1790 1862 1841-04-04 1845-03-04 1430   3.9
## 11             John Adams 1735 1826 1797-03-04 1801-03-04 1460   4.0
## 12      John Quincy Adams 1767 1848 1825-03-04 1829-03-04 1461   4.0
## 13       Martin Van Buren 1782 1862 1837-03-04 1841-03-04 1461   4.0
## 14          James K. Polk 1795 1849 1845-03-04 1849-03-04 1461   4.0
## 15        Franklin Pierce 1804 1869 1853-03-04 1857-03-04 1461   4.0
## 16         James Buchanan 1791 1868 1857-03-04 1861-03-04 1461   4.0
## 17    Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 1461   4.0
## 18       Grover Cleveland 1837 1908 1885-03-04 1889-03-04 1461   4.0
## 19      Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 1461   4.0
## 20       Grover Cleveland 1837 1908 1893-03-04 1897-03-04 1461   4.0
## 21    William Howard Taft 1857 1930 1909-03-04 1913-03-04 1461   4.0
## 22         Herbert Hoover 1874 1964 1929-03-04 1933-03-04 1461   4.0
## 23           Jimmy Carter 1924 <NA> 1977-01-20 1981-01-20 1461   4.0
## 24      George H. W. Bush 1924 <NA> 1989-01-20 1993-01-20 1461   4.0
## 25        Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 1503   4.1
## 26       William McKinley 1843 1901 1897-03-04 1901-09-14 1654   4.5
## 27      Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 1886   5.2
## 28          Richard Nixon 1913 1994 1969-01-20 1974-08-09 2027   5.6
## 29        Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 2041   5.6
## 30     Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 2728   7.5
## 31        Harry S. Truman 1884 1972 1945-04-12 1953-01-20 2840   7.8
## 32      George Washington 1732 1799 1789-04-30 1797-03-04 2865   7.8
## 33       Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 2922   8.0
## 34          James Madison 1751 1836 1809-03-04 1817-03-04 2922   8.0
## 35           James Monroe 1758 1831 1817-03-04 1825-03-04 2922   8.0
## 36         Andrew Jackson 1767 1845 1829-03-04 1837-03-04 2922   8.0
## 37       Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 2922   8.0
## 38         Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 2922   8.0
## 39   Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 2922   8.0
## 40          Ronald Reagan 1911 2004 1981-01-20 1989-01-20 2922   8.0
## 41           Bill Clinton 1946 <NA> 1993-01-20 2001-01-20 2922   8.0
## 42         George W. Bush 1946 <NA> 2001-01-20 2009-01-20 2922   8.0
## 43           Barack Obama 1961 <NA> 2009-01-20 2017-01-20 2922   8.0
## 44  Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 4422  12.1
mean(prez.all$Days)
## [1] 1890.341
mean(prez.all$Days) / 365.25
## [1] 5.175471
  • Which president is the oldest
  • What is the average age of all the presidents

The oldest living president from our table above is George H. W. Bush who is about 3 months older than Jimmy Carter. If we want to see which presidents lived to be the oldest I show ages in years below (Gerald Ford and Ronald Reagan both died at 93). I also show their age at the end of their term to show how old they were in office (the oldest in office was Reagan at 78) and the average age was 71.

prez.all$Died[is.na(prez.all$Died)] <- "2015"  # to get an age for living presidents
prez.all <- mutate(prez.all, Age = as.numeric(Died) - as.numeric(Born))
# arrange(prez.all, desc(Age), End)
arrange(subset(prez.all, Days > 0, select = c(-State, -Party)), desc(Age), End)
##                 President Born Died      Start        End Days Years Age
## 1             Gerald Ford 1913 2006 1974-08-09 1977-01-20  895   2.5  93
## 2           Ronald Reagan 1911 2004 1981-01-20 1989-01-20 2922   8.0  93
## 3              John Adams 1735 1826 1797-03-04 1801-03-04 1460   4.0  91
## 4            Jimmy Carter 1924 2015 1977-01-20 1981-01-20 1461   4.0  91
## 5       George H. W. Bush 1924 2015 1989-01-20 1993-01-20 1461   4.0  91
## 6          Herbert Hoover 1874 1964 1929-03-04 1933-03-04 1461   4.0  90
## 7         Harry S. Truman 1884 1972 1945-04-12 1953-01-20 2840   7.8  88
## 8           James Madison 1751 1836 1809-03-04 1817-03-04 2922   8.0  85
## 9        Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 2922   8.0  83
## 10      John Quincy Adams 1767 1848 1825-03-04 1829-03-04 1461   4.0  81
## 11          Richard Nixon 1913 1994 1969-01-20 1974-08-09 2027   5.6  81
## 12       Martin Van Buren 1782 1862 1837-03-04 1841-03-04 1461   4.0  80
## 13   Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 2922   8.0  79
## 14         Andrew Jackson 1767 1845 1829-03-04 1837-03-04 2922   8.0  78
## 15         James Buchanan 1791 1868 1857-03-04 1861-03-04 1461   4.0  77
## 16       Millard Fillmore 1800 1874 1850-07-09 1853-03-04  969   2.7  74
## 17           James Monroe 1758 1831 1817-03-04 1825-03-04 2922   8.0  73
## 18    William Howard Taft 1857 1930 1909-03-04 1913-03-04 1461   4.0  73
## 19             John Tyler 1790 1862 1841-04-04 1845-03-04 1430   3.9  72
## 20    Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 1461   4.0  71
## 21       Grover Cleveland 1837 1908 1885-03-04 1889-03-04 1461   4.0  71
## 22       Grover Cleveland 1837 1908 1893-03-04 1897-03-04 1461   4.0  71
## 23           Bill Clinton 1946 2015 1993-01-20 2001-01-20 2922   8.0  69
## 24         George W. Bush 1946 2015 2001-01-20 2009-01-20 2922   8.0  69
## 25 William Henry Harrison 1773 1841 1841-03-04 1841-04-04   31   0.1  68
## 26      Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 1461   4.0  68
## 27         Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 2922   8.0  68
## 28      George Washington 1732 1799 1789-04-30 1797-03-04 2865   7.8  67
## 29         Andrew Johnson 1808 1875 1865-04-15 1869-03-04 1419   3.9  67
## 30         Zachary Taylor 1784 1850 1849-03-04 1850-07-09  492   1.3  66
## 31        Franklin Pierce 1804 1869 1853-03-04 1857-03-04 1461   4.0  65
## 32      Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 1886   5.2  65
## 33       Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 2922   8.0  63
## 34  Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 4422  12.1  63
## 35     Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 2728   7.5  61
## 36        Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 2041   5.6  61
## 37       William McKinley 1843 1901 1897-03-04 1901-09-14 1654   4.5  58
## 38      Warren G. Harding 1865 1923 1921-03-04 1923-08-02  881   2.4  58
## 39      Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 1262   3.5  57
## 40        Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 1503   4.1  56
## 41          James K. Polk 1795 1849 1845-03-04 1849-03-04 1461   4.0  54
## 42           Barack Obama 1961 2015 2009-01-20 2017-01-20 2922   8.0  54
## 43      James A. Garfield 1831 1881 1881-03-04 1881-09-19  199   0.5  50
## 44        John F. Kennedy 1917 1963 1961-01-20 1963-11-22 1036   2.8  46
prez.all <- mutate(prez.all, InOffice = as.numeric(year(ymd(End))) - as.numeric(Born))
# arrange(prez.all, desc(InOffice))
arrange(subset(prez.all, Days > 0, select = c(-State, -Party, -Days, -Years)), desc(InOffice))
##                 President Born Died      Start        End Age InOffice
## 1           Ronald Reagan 1911 2004 1981-01-20 1989-01-20  93       78
## 2    Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20  79       71
## 3          Andrew Jackson 1767 1845 1829-03-04 1837-03-04  78       70
## 4          James Buchanan 1791 1868 1857-03-04 1861-03-04  77       70
## 5         Harry S. Truman 1884 1972 1945-04-12 1953-01-20  88       69
## 6       George H. W. Bush 1924 2015 1989-01-20 1993-01-20  91       69
## 7  William Henry Harrison 1773 1841 1841-03-04 1841-04-04  68       68
## 8            James Monroe 1758 1831 1817-03-04 1825-03-04  73       67
## 9              John Adams 1735 1826 1797-03-04 1801-03-04  91       66
## 10       Thomas Jefferson 1743 1826 1801-03-04 1809-03-04  83       66
## 11          James Madison 1751 1836 1809-03-04 1817-03-04  85       66
## 12         Zachary Taylor 1784 1850 1849-03-04 1850-07-09  66       66
## 13      George Washington 1732 1799 1789-04-30 1797-03-04  67       65
## 14         Woodrow Wilson 1856 1924 1913-03-04 1921-03-04  68       65
## 15            Gerald Ford 1913 2006 1974-08-09 1977-01-20  93       64
## 16  Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12  63       63
## 17         George W. Bush 1946 2015 2001-01-20 2009-01-20  69       63
## 18      John Quincy Adams 1767 1848 1825-03-04 1829-03-04  81       62
## 19         Andrew Johnson 1808 1875 1865-04-15 1869-03-04  67       61
## 20      Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20  65       61
## 21          Richard Nixon 1913 1994 1969-01-20 1974-08-09  81       61
## 22      Benjamin Harrison 1833 1901 1889-03-04 1893-03-04  68       60
## 23       Grover Cleveland 1837 1908 1893-03-04 1897-03-04  71       60
## 24       Martin Van Buren 1782 1862 1837-03-04 1841-03-04  80       59
## 25    Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04  71       59
## 26         Herbert Hoover 1874 1964 1929-03-04 1933-03-04  90       59
## 27       William McKinley 1843 1901 1897-03-04 1901-09-14  58       58
## 28      Warren G. Harding 1865 1923 1921-03-04 1923-08-02  58       58
## 29        Calvin Coolidge 1872 1933 1923-08-02 1929-03-04  61       57
## 30           Jimmy Carter 1924 2015 1977-01-20 1981-01-20  91       57
## 31        Abraham Lincoln 1809 1865 1861-03-04 1865-04-15  56       56
## 32      Chester A. Arthur 1829 1886 1881-09-19 1885-03-04  57       56
## 33    William Howard Taft 1857 1930 1909-03-04 1913-03-04  73       56
## 34           Barack Obama 1961 2015 2009-01-20 2017-01-20  54       56
## 35             John Tyler 1790 1862 1841-04-04 1845-03-04  72       55
## 36       Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04  63       55
## 37           Bill Clinton 1946 2015 1993-01-20 2001-01-20  69       55
## 38          James K. Polk 1795 1849 1845-03-04 1849-03-04  54       54
## 39       Millard Fillmore 1800 1874 1850-07-09 1853-03-04  74       53
## 40        Franklin Pierce 1804 1869 1853-03-04 1857-03-04  65       53
## 41       Grover Cleveland 1837 1908 1885-03-04 1889-03-04  71       52
## 42     Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04  61       51
## 43      James A. Garfield 1831 1881 1881-03-04 1881-09-19  50       50
## 44        John F. Kennedy 1917 1963 1961-01-20 1963-11-22  46       46
round(mean(prez.all$Age),0)
## [1] 71

To wrap up I used the Count function from dplyr to summarize the data. We answer the old questions of which state (NY) and which party (Republican) had the most presidents.

count(prez.all, State, sort = TRUE)
## Source: local data frame [17 x 2]
## 
##            State     n
##            (chr) (int)
## 1       New York     8
## 2           Ohio     6
## 3       Virginia     5
## 4  Massachusetts     4
## 5     California     3
## 6       Illinois     3
## 7      Tennessee     3
## 8          Texas     3
## 9       Arkansas     1
## 10       Georgia     1
## 11       Indiana     1
## 12     Louisiana     1
## 13      Michigan     1
## 14      Missouri     1
## 15 New Hampshire     1
## 16    New Jersey     1
## 17  Pennsylvania     1
count(prez.all, Party, sort = TRUE)
## Source: local data frame [6 x 2]
## 
##                   Party     n
##                   (chr) (int)
## 1            Republican    18
## 2            Democratic    16
## 3 Democratic-Republican     4
## 4                  Whig     4
## 5            Federalist     1
## 6           Independent     1

Dataset 3: Reviews on Amazon

I got the idea to examine Amazon reviews and the Banana slicer specifically from classmate Joy Peyton. I believe web scraping text and reviews in particular may be something I need to do. I started with reading data in from Amazon, getting the data into a dataframe, and thought I would focus on sentiment analysis.

Retrieving Amazon Reviews

This is very similar to the work on the presidents. I looked at the table information to see if I could read things in almost directly, like with the living presidents’ data. That did not seem possible, however the data selectors were clearly mark and I was able to get a tag for each field or column I wanted. I chose:

  • title - for the review heading
  • author - for the ID or name of the review writer
  • date - for the date the review was posted
  • rate - for the 5 star rating system
  • review - for the free-form text review
amazon.html <- read_html("http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/ref=cm_cr_dp_see_all_btm?ie=UTF8&showViewpoints=1&sortBy=bySubmissionDateDescending", encoding = "UTF-8")
title <- html_nodes(amazon.html, ".a-color-base.a-text-bold")
title <- html_text(title)
author <- html_nodes(amazon.html, ".author")
author <- html_text(author)
date <- html_nodes(amazon.html, "#cm_cr-review_list .review-date")
date <- html_text(date)
date <- as.Date(mdy(gsub("on ", "", date)))
rate <- html_nodes(amazon.html, "#cm_cr-review_list .review-rating")
rate <- html_text(rate)
rate <- as.numeric(gsub(" out of 5 stars", "", rate))
review <- html_nodes(amazon.html, ".review-data+ .review-data")
review <- html_text(review)
reviews <- data.frame(author, title, rate, date, review, stringsAsFactors = FALSE)
reviews
##                   author
## 1               Sarah G.
## 2          Brandon Braud
## 3        Amazon Customer
## 4           M. Underhill
## 5               R. Steen
## 6  Janice Konstantinidis
## 7             onlinegirl
## 8                  paula
## 9              Chameleon
## 10          Fritz Finley
##                                                     title rate       date
## 1                                    Independence slicer!    5 2015-10-12
## 2                          surprised by elegant efficacy.    5 2015-10-12
## 3                                           Life changing    5 2015-10-10
## 4                                              Five Stars    5 2015-10-10
## 5                                              Five Stars    5 2015-10-09
## 6  In eternal wonderment of the Hutzler 521 Banana Slicer    5 2015-10-08
## 7                     Just don't go in the water with it!    3 2015-10-08
## 8                                              Do Not Buy    2 2015-10-05
## 9                                         Disappointed...    1 2015-10-05
## 10                                             Five Stars    5 2015-10-02
review
y twins LOVE making their own banana snack. We started using this when they were about 18 months and have had no problems or worries that they would cut themselves. 1 year olds + the banana slicer = independence = one happy momma!!
t work for turds also.  Don't put slice turd on sereal tho.
used to think people who bought this are weirdos. But i wanted to see what the heck the hype was about so I got one for myself. Let  me tell you, OH MY GOD. It changed my way of life forever! who has time to get a knife when all you want is a sliced banana? Worry no more, this is everything you ever needed in life! Easy to use and clean, you'll have your slice banana in no time! I always keep this amazing device in my bag since I never know when I feel like eating sliced bananas. Plain whole bananas just won't do! They have to be sliced by this device!
hey've done studies, you know. 60% of the time it works every time.
ame changer. Everything you knew about slicing bananas has been turned on its head.
## 6  I was given this Banana Slicer as a gift. I am nothing short of amazed at how it's changed my life. I can't think of one aspect of my existence that this product has not enhanced. I woke up this morning with a new joy de vivre, raison de'etre. My consciousness flooded with the knowledge that I have the Banana Slicer. I immediately began to meditate on it. I felt a new awareness and appreciation of all things. Cutting my morning banana was an artform.  Later in the day I began to visualize it as I went about my daily chores.  No longer daunted by the mediocrity of usual daily imperatives, I delved into my inner self for answers I'd been seeking, and voila I was rewarded. I had found the meaning of my life. I had found my truth, my metaphysical self. My all.  Thank you Teri. Elizabeth Barrett Browning, 1806 - 1861 How do I love thee? Let me count the ways.I love thee to the depth and breadth and heightMy soul can reach, when feeling out of sightFor the ends of being and ideal grace.I love thee to the level of every dayâ<U+0080><U+0099>sMost quiet need, by sun and candle-light.I love thee freely, as men strive for right.I love thee purely, as they turn from praise.I love thee with the passion put to useIn my old griefs, and with my childhoodâ<U+0080><U+0099>s faith.I love thee with a love I seemed to loseWith my lost saints. I love thee with the breath,Smiles, tears, of all my life; and, if God choose,I shall but love thee better after death.
h...You're going to need a bigger banana.
not sharp enough and it smushed the banana as it "cuts" through it. Not worth any price.
n the description for the Hutzler 571 Banana Slicer, it says "Great for cereal".But I found that it's not great for cereal at all! Not only did it cut my tongue, but it was hard to chew and didn't taste very good. What were they thinking? I'm going back to oatmeal.
anana Fun!

Tidying the Reviews

We can see from above that we were able to read most data straight from the web page with the rvest function html_text. Before I describe the minimal tidying needed to get this web-page-wide information into a fairly narrow dataframe, I want to discuss a problem that I feel is more about Retrieving.

There are over 5,000 reviews of the Hutzler 571 Banana Slicer, but I only seem to be able to access them 10 at a time. I spent some time trying to figure this out and ran out of time. Maybe a web developer can give me some ideas. This 10 at a time problem would greatly hinder the usefulness of using R to analyze web-based reviews.

Back to tidying the data. When reading in the dates I captured “on March 3, 2011” and so on for each date. I used gsub to remove the “on” and space. I used mdy from the lubridate package in combination with as.Date to get the dates ready for a dataframe. date <- as.Date(mdy(gsub(“on”, “”, date)))

In a similar way I used gsub again to pull " out of 5 stars" out of my rating and as.numeric to convert the remaining character number into something we can add. rate <- as.numeric(gsub(" out of 5 stars“,”“, rate))

I spent more time trying to make the review text field display better in the dataframe print to no avail. When I display a single column it looks fine, like one character string, but not so when I display the entire dataframe. And, tbl_df from dplyr did not help.

tbl_df(reviews)
## Source: local data frame [10 x 5]
## 
##                   author
##                    (chr)
## 1               Sarah G.
## 2          Brandon Braud
## 3        Amazon Customer
## 4           M. Underhill
## 5               R. Steen
## 6  Janice Konstantinidis
## 7             onlinegirl
## 8                  paula
## 9              Chameleon
## 10          Fritz Finley
## Variables not shown: title (chr), rate (dbl), date (date), review (chr)

Analyzing the Reviews

What really got me interested in this example was Joy’s idea on sentiment analysis and the concept of a satire detector. I have done a simple sentiment analysis in Python before and I hoped to leverage that work.

My sketched out ideas amount to this.

  • Find and load a sentiment dictionary in R
  • Write an R function that compares a review with the dictionary
  • Add (both positive & negative) up point values for dictionary words found in review
  • Assign a sentiment value to the review
  • See if a correlation between a very high sentiment score with a high rating = satire
  • See if a correlation between a very high sentiment score with a low rating = irony or sarcasm

I ran out of time before I got the dictionary working well enough to test a function. My limited data set would have made it difficult to test for satire.

I can give you the average rating from my 10 reviews …

round(mean(reviews$rate),1)
## [1] 4.1