Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
The URL to the .Rmd file in your GitHub repository. and
The URL for your rpubs.com web page.
## Loading required package: xml2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
More and more US government agencies are reporting data on global warming and NASA and NOAA are leading the fray. I thought this dataset from NOAA was a good example of compacted data, where multiple observations sharing a few fields are mashed together.
Here is a screen shot of the table from the NOAA website to show where I started.
The data from NOAA downloaded as a CVS file (there were options for XML and json too). I was able to load it easily into a dataframe once I marked a comment line with a # to skip it. It came in as 17 rows with 12 columns, almost square, and not as compacted as it looked on the webpage.
tempdata <- read.csv("https://raw.githubusercontent.com/Godbero/CUNY-MSDA-IS607/master/noaa.csv", header = TRUE, sep = ",", comment.char = "#", stringsAsFactors = FALSE)
tbl_df(tempdata)
## Source: local data frame [17 x 12]
##
## Period Value Twentieth.Century.Mean Departure Low.Rank High.Rank
## (int) (dbl) (dbl) (dbl) (int) (int)
## 1 1 72.95 72.10 0.85 91 31
## 2 2 73.44 72.86 0.58 87 35
## 3 3 72.75 71.40 1.35 110 12
## 4 4 69.77 68.60 1.17 108 14
## 5 5 66.47 65.09 1.38 115 7
## 6 6 62.96 61.16 1.80 115 7
## 7 7 58.67 57.25 1.42 109 13
## 8 8 55.48 53.86 1.62 113 9
## 9 9 53.40 51.52 1.88 116 5
## 10 10 51.99 50.54 1.45 111 10
## 11 11 52.44 50.86 1.58 113 8
## 12 12 53.58 52.03 1.55 113 8
## 13 18 56.20 55.07 1.13 107 14
## 14 24 52.81 52.02 0.79 97 23
## 15 36 52.88 52.00 0.88 100 19
## 16 48 53.48 51.99 1.49 112 6
## 17 60 53.37 51.98 1.39 111 6
## Variables not shown: Record.Low (int), Record.High (int), Lowest.Since
## (int), Highest.Since (int), Percentile (chr), Ties (chr)
Since they CSV file already broke out the highest and lowest ranks and dates, I decided to look at the same time frame over time to look for a pattern. Looking at the screen shot above we see period 60 is September 2010 to August 2015. Period 48 is September 2011 to August 2015; period 36 is September 2012 to August 2015; period 24 is September 2013 to August 2015; and period 12 is September 2014 to August 2015. These are the rows I want. As for columns I choose Value (temp in F), Mean, Departure (difference) and High.Rank.
tidy.temp <- subset(tempdata, Period >= 12, select = c(Period, Value, Twentieth.Century.Mean, Departure, High.Rank))
colnames(tidy.temp) <- c("Period", "Temp", "Mean", "Diff", "Rank")
tidy.temp <- tidy.temp[-2, ] # to get rid of row 18
tidy.temp
## Period Temp Mean Diff Rank
## 12 12 53.58 52.03 1.55 8
## 14 24 52.81 52.02 0.79 23
## 15 36 52.88 52.00 0.88 19
## 16 48 53.48 51.99 1.49 6
## 17 60 53.37 51.98 1.39 6
I made a nice small dataset and that doesn’t leave me much to say. We can look at the table in period order and see that each year does not get linearly hotter. Although that most recent year shows the largest difference from the mean, showing it was the warmest, the 2 and 3 year values are not each gradually cooler (which would happen if it got a little warmer each year on average).
If we sort by Temp we see that 1 year was the hottest, followed by 4 and 5. It is good that the temperatures are in sync with the difference from the mean. It is almost a test that the data makes some sense and we see it correctly.
I am not sure I understand Rank. The data for the 4 and 5 year time span rank as the 6th highest, but the 1 year data, which is the highest temperature and difference, ranks as the 8th highest. I don’t have anything else to say about this dataset.
arrange(tidy.temp, Temp)
## Period Temp Mean Diff Rank
## 1 24 52.81 52.02 0.79 23
## 2 36 52.88 52.00 0.88 19
## 3 60 53.37 51.98 1.39 6
## 4 48 53.48 51.99 1.49 6
## 5 12 53.58 52.03 1.55 8
arrange(tidy.temp, Rank)
## Period Temp Mean Diff Rank
## 1 48 53.48 51.99 1.49 6
## 2 60 53.37 51.98 1.39 6
## 3 12 53.58 52.03 1.55 8
## 4 36 52.88 52.00 0.88 19
## 5 24 52.81 52.02 0.79 23
I got the idea, URL and analysis questions from Youqing Xiang in our class. I started down the path of reading data in from Wikipedia, because I thought that was the new focus of this project (re-thought that after a post from Andy), and because it reminded me of the Introduction chapter of our textbook on UNESCO sites.
This turned out to be a long path, because of the recent change from http to https on Wikipedia (and most other sites). This made the example code in the textbook not work. I went to the textbook website and followed their suggestion to use Hadley Wickham’s new rvest package.
We start by loading rvest and using it to read the List of Presidents from Wikipedia. The result is a List of 2 in prez_html that contains a lot of HTML code I did not print here. The first line is the header information and the second line is the body HTML code. Using the Nodes function allows me to search the HTML code for table. The result is a List of 10 times table is used on the page.
It does not take long to see that lines 1 and 2 are wikitables and our candidates for analysis. I print out table 2 here, since it is smaller and easier to recognize useful information. We have a line for the table header and each living former president. We use the Table function to get the data ready for tidying and analysis. We will see more of the bigger table below. It proves to be more of a problem due to an inconsistent number of columns.
prez.html <- read_html("https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States", encoding = "UTF-8")
prez.tables <- html_nodes(prez.html, "table")
prez.tables
## {xml_nodeset (10)}
## [1] <table class="wikitable" style="text-align: center;">\n<tr>\n<th>â<U+0084> ...
## [2] <table class="wikitable">\n<tr>\n<th>President</th>\n<th>Term of of ...
## [3] <table style="background:#f9f9f9;font-size:85%;line-height:110%;max ...
## [4] <table class="mbox-small plainlinks sistersitebox" style="border:1p ...
## [5] <table class="navbox" style="border-spacing:0">\n<tr>\n<td style="p ...
## [6] <table class="nowraplinks collapsible autocollapse navbox-inner" st ...
## [7] <table class="navbox" style="border-spacing:0">\n<tr>\n<td style="p ...
## [8] <table class="nowraplinks collapsible autocollapse navbox-inner" st ...
## [9] <table class="navbox" style="border-spacing:0">\n<tr>\n<td style="p ...
## [10] <table class="nowraplinks collapsible =expanded navbox-inner" style ...
prez.tables[[2]]
## {xml_node}
## <table>
## [1] <tr>\n<th>President</th>\n<th>Term of office</th>\n<th>Date of birth ...
## [2] <tr>\n<td><a href="/wiki/George_H._W._Bush" title="George H. W. Bush ...
## [3] <tr>\n<td><a href="/wiki/Jimmy_Carter" title="Jimmy Carter">Jimmy Ca ...
## [4] <tr>\n<td><a href="/wiki/George_W._Bush" title="George W. Bush">Geor ...
## [5] <tr>\n<td><a href="/wiki/Bill_Clinton" title="Bill Clinton">Bill Cli ...
prez.former.live <- html_table(prez.tables[[2]], header = TRUE)
prez.former.live
## President Term of office Date of birth
## 1 George H. W. Bush 1989â<U+0080><U+0093>1993 (1924-06-12) June 12, 1924 (age 91)
## 2 Jimmy Carter 1977â<U+0080><U+0093>1981 (1924-10-01) October 1, 1924 (age 91)
## 3 George W. Bush 2001â<U+0080><U+0093>2009 (1946-07-06) July 6, 1946 (age 69)
## 4 Bill Clinton 1993â<U+0080><U+0093>2001 (1946-08-19) August 19, 1946 (age 69)
We look at the smaller table of Living Former Presidents first before diving into all of the US presidents.
For the smaller table (prez.former.live) we have a dataframe of 4 observations and 3 variables. This is a case of of data not being wide enough. It would be a lot simpler if age was not embedded with date of birth, if date of birth was in one format, and Term of Office was broken out into start and end dates (years).
term.start <- str_extract(prez.former.live$'Term of office', "[1-2]\\d{3}")
term.end.tmp <- str_extract(prez.former.live$'Term of office', "[:punct:][1-2]\\d{3}")
term.end <- str_extract(term.end.tmp, "[1-2]\\d{3}")
prez.dob <- str_extract(prez.former.live$'Date of birth', "\\d{4}-\\d{2}-\\d{2}")
prez.former.live$Start <- as.numeric(term.start)
prez.former.live$End <- as.numeric(term.end)
prez.former.live$`Date of birth` <- as.Date(prez.dob)
prez.former.live <- subset(prez.former.live, select = -2)
colnames(prez.former.live)[2] <- "DOB"
prez.former.live
## President DOB Start End
## 1 George H. W. Bush 1924-06-12 1989 1993
## 2 Jimmy Carter 1924-10-01 1977 1981
## 3 George W. Bush 1946-07-06 2001 2009
## 4 Bill Clinton 1946-08-19 1993 2001
It turns out we need more than just the Table function to get all the presidents. We need to find a more focused CSS label to look for to attempt to pull out columns of data. In the documentation of rvest I found a utility called SelectorGadget mentioned that helps with this. I used it to find the best selector for the presidents’ names was “b a”, and for birth to death dates “b~ small” and so on. I list out the results of prez.dob here to show the finding of the raw HTML data and conversion into text. After this step I have Names, birth and death dates, states, terms and parties (just for fun).
prez.names <- html_nodes(prez.html, "b a")
prez.names <- html_text(prez.names)
prez.dob <- html_nodes(prez.html, "b~ small")
prez.dob
## {xml_nodeset (44)}
## [1] <small>(1732â<U+0080><U+0093>1799)</small>
## [2] <small>(1735â<U+0080><U+0093>1826)</small>
## [3] <small>(1743â<U+0080><U+0093>1826)</small>
## [4] <small>(1751â<U+0080><U+0093>1836)</small>
## [5] <small>(1758â<U+0080><U+0093>1831)</small>
## [6] <small>(1767â<U+0080><U+0093>1848)</small>
## [7] <small>(1767â<U+0080><U+0093>1845)</small>
## [8] <small>(1782â<U+0080><U+0093>1862)</small>
## [9] <small>(1773â<U+0080><U+0093>1841)</small>
## [10] <small>(1790â<U+0080><U+0093>1862)</small>
## [11] <small>(1795â<U+0080><U+0093>1849)</small>
## [12] <small>(1784â<U+0080><U+0093>1850)</small>
## [13] <small>(1800â<U+0080><U+0093>1874)</small>
## [14] <small>(1804â<U+0080><U+0093>1869)</small>
## [15] <small>(1791â<U+0080><U+0093>1868)</small>
## [16] <small>(1809â<U+0080><U+0093>1865)</small>
## [17] <small>(1808â<U+0080><U+0093>1875)</small>
## [18] <small>(1822â<U+0080><U+0093>1885)</small>
## [19] <small>(1822â<U+0080><U+0093>1893)</small>
## [20] <small>(1831â<U+0080><U+0093>1881)</small>
## ...
prez.dob <- html_text(prez.dob)
prez.dob
## [1] "(1732â<U+0080><U+0093>1799)" "(1735â<U+0080><U+0093>1826)" "(1743â<U+0080><U+0093>1826)" "(1751â<U+0080><U+0093>1836)"
## [5] "(1758â<U+0080><U+0093>1831)" "(1767â<U+0080><U+0093>1848)" "(1767â<U+0080><U+0093>1845)" "(1782â<U+0080><U+0093>1862)"
## [9] "(1773â<U+0080><U+0093>1841)" "(1790â<U+0080><U+0093>1862)" "(1795â<U+0080><U+0093>1849)" "(1784â<U+0080><U+0093>1850)"
## [13] "(1800â<U+0080><U+0093>1874)" "(1804â<U+0080><U+0093>1869)" "(1791â<U+0080><U+0093>1868)" "(1809â<U+0080><U+0093>1865)"
## [17] "(1808â<U+0080><U+0093>1875)" "(1822â<U+0080><U+0093>1885)" "(1822â<U+0080><U+0093>1893)" "(1831â<U+0080><U+0093>1881)"
## [21] "(1829â<U+0080><U+0093>1886)" "(1837â<U+0080><U+0093>1908)" "(1833â<U+0080><U+0093>1901)" "(1837â<U+0080><U+0093>1908)"
## [25] "(1843â<U+0080><U+0093>1901)" "(1858â<U+0080><U+0093>1919)" "(1857â<U+0080><U+0093>1930)" "(1856â<U+0080><U+0093>1924)"
## [29] "(1865â<U+0080><U+0093>1923)" "(1872â<U+0080><U+0093>1933)" "(1874â<U+0080><U+0093>1964)" "(1882â<U+0080><U+0093>1945)"
## [33] "(1884â<U+0080><U+0093>1972)" "(1890â<U+0080><U+0093>1969)" "(1917â<U+0080><U+0093>1963)" "(1908â<U+0080><U+0093>1973)"
## [37] "(1913â<U+0080><U+0093>1994)" "(1913â<U+0080><U+0093>2006)" "(born 1924)" "(1911â<U+0080><U+0093>2004)"
## [41] "(born 1924)" "(born 1946)" "(born 1946)" "(born 1961)"
prez.states <- html_nodes(prez.html, "td:nth-child(4) a")
prez.states <- html_text(prez.states)
prez.terms <- html_nodes(prez.html, "td > .date")
prez.terms <- html_text(prez.terms)
prez.party <- html_nodes(prez.html, "td:nth-child(6) a")
prez.party <- html_text(prez.party)
We have strings for each of our desired columns of data and a quick check shows they are different lengths. There are 44 presidents in the Wikipedia table, so a multiple of 44 would be good.
We need to trim most of these string to make a nice table. Below I start with prez.dob since it is the correct (length) number of values and separate date born from date died.
Born <- str_extract(prez.dob, "[1-2]\\d{3}")
died.tmp <- str_extract(prez.dob, "[1-2]\\d{3}[:punct:]$")
Died <- str_extract(died.tmp, "[1-2]\\d{3}") # I get the birthdate again for living presidents
Died[Born == Died] <- NA # that fixes the undead
State <- prez.states[prez.states != "Andrew Johnson"] # Andrew Johnson was my 45th state
Party <- prez.party[prez.party != c("[14]", "[n 12]", "[n 13]")] # did not remove [14]???
Party <- Party[Party != "[14]"] # still 1 too many
Party <- Party[Party != "National Union"] # a party we don't need
Party <- sub("\n", "", Party) # get rid of newlines just in case
breakstr <- prez.terms[63:85] # no dates for FDR, break at Truman
prez.terms[63] <- "March 4, 1933" # FDR starts
prez.terms[64] <- "April 12, 1945" # FDR dies after starting 4th term
prez.terms[65:87] <- breakstr
prez.terms[88] <- "January 20, 2017" # To cover current office holder, Obama
Start <- prez.terms[c(TRUE, FALSE)]
End <- prez.terms[c(FALSE, TRUE)]
President <- prez.names[1:44] # I'm sure there is a more elegant way
prez.all <- data.frame(President, Born, Died, Start, End, State, Party, stringsAsFactors = FALSE)
prez.all
## President Born Died Start End
## 1 George Washington 1732 1799 April 30, 1789 March 4, 1797
## 2 John Adams 1735 1826 March 4, 1797 March 4, 1801
## 3 Thomas Jefferson 1743 1826 March 4, 1801 March 4, 1809
## 4 James Madison 1751 1836 March 4, 1809 March 4, 1817
## 5 James Monroe 1758 1831 March 4, 1817 March 4, 1825
## 6 John Quincy Adams 1767 1848 March 4, 1825 March 4, 1829
## 7 Andrew Jackson 1767 1845 March 4, 1829 March 4, 1837
## 8 Martin Van Buren 1782 1862 March 4, 1837 March 4, 1841
## 9 William Henry Harrison 1773 1841 March 4, 1841 April 4, 1841
## 10 John Tyler 1790 1862 April 4, 1841 March 4, 1845
## 11 James K. Polk 1795 1849 March 4, 1845 March 4, 1849
## 12 Zachary Taylor 1784 1850 March 4, 1849 July 9, 1850
## 13 Millard Fillmore 1800 1874 July 9, 1850 March 4, 1853
## 14 Franklin Pierce 1804 1869 March 4, 1853 March 4, 1857
## 15 James Buchanan 1791 1868 March 4, 1857 March 4, 1861
## 16 Abraham Lincoln 1809 1865 March 4, 1861 April 15, 1865
## 17 Andrew Johnson 1808 1875 April 15, 1865 March 4, 1869
## 18 Ulysses S. Grant 1822 1885 March 4, 1869 March 4, 1877
## 19 Rutherford B. Hayes 1822 1893 March 4, 1877 March 4, 1881
## 20 James A. Garfield 1831 1881 March 4, 1881 September 19, 1881
## 21 Chester A. Arthur 1829 1886 September 19, 1881 March 4, 1885
## 22 Grover Cleveland 1837 1908 March 4, 1885 March 4, 1889
## 23 Benjamin Harrison 1833 1901 March 4, 1889 March 4, 1893
## 24 Grover Cleveland 1837 1908 March 4, 1893 March 4, 1897
## 25 William McKinley 1843 1901 March 4, 1897 September 14, 1901
## 26 Theodore Roosevelt 1858 1919 September 14, 1901 March 4, 1909
## 27 William Howard Taft 1857 1930 March 4, 1909 March 4, 1913
## 28 Woodrow Wilson 1856 1924 March 4, 1913 March 4, 1921
## 29 Warren G. Harding 1865 1923 March 4, 1921 August 2, 1923
## 30 Calvin Coolidge 1872 1933 August 2, 1923 March 4, 1929
## 31 Herbert Hoover 1874 1964 March 4, 1929 March 4, 1933
## 32 Franklin D. Roosevelt 1882 1945 March 4, 1933 April 12, 1945
## 33 Harry S. Truman 1884 1972 April 12, 1945 January 20, 1953
## 34 Dwight D. Eisenhower 1890 1969 January 20, 1953 January 20, 1961
## 35 John F. Kennedy 1917 1963 January 20, 1961 November 22, 1963
## 36 Lyndon B. Johnson 1908 1973 November 22, 1963 January 20, 1969
## 37 Richard Nixon 1913 1994 January 20, 1969 August 9, 1974
## 38 Gerald Ford 1913 2006 August 9, 1974 January 20, 1977
## 39 Jimmy Carter 1924 <NA> January 20, 1977 January 20, 1981
## 40 Ronald Reagan 1911 2004 January 20, 1981 January 20, 1989
## 41 George H. W. Bush 1924 <NA> January 20, 1989 January 20, 1993
## 42 Bill Clinton 1946 <NA> January 20, 1993 January 20, 2001
## 43 George W. Bush 1946 <NA> January 20, 2001 January 20, 2009
## 44 Barack Obama 1961 <NA> January 20, 2009 January 20, 2017
## State Party
## 1 Virginia Independent
## 2 Massachusetts Federalist
## 3 Virginia Democratic-Republican
## 4 Virginia Democratic-Republican
## 5 Virginia Democratic-Republican
## 6 Massachusetts Democratic-Republican
## 7 Tennessee Democratic
## 8 New York Democratic
## 9 Ohio Whig
## 10 Virginia Whig
## 11 Tennessee Democratic
## 12 Louisiana Whig
## 13 New York Whig
## 14 New Hampshire Democratic
## 15 Pennsylvania Democratic
## 16 Illinois Republican
## 17 Tennessee Democratic
## 18 Illinois Republican
## 19 Ohio Republican
## 20 Ohio Republican
## 21 New York Republican
## 22 New York Democratic
## 23 Indiana Republican
## 24 New York Democratic
## 25 Ohio Republican
## 26 New York Republican
## 27 Ohio Republican
## 28 New Jersey Democratic
## 29 Ohio Republican
## 30 Massachusetts Republican
## 31 California Republican
## 32 New York Democratic
## 33 Missouri Democratic
## 34 New York Republican
## 35 Massachusetts Democratic
## 36 Texas Democratic
## 37 California Republican
## 38 Michigan Republican
## 39 Georgia Democratic
## 40 California Republican
## 41 Texas Republican
## 42 Arkansas Democratic
## 43 Texas Republican
## 44 Illinois Democratic
Youqing Xiang suggested in her post the following ideas for analysis:
With the short list of 4 presidents this is not very complicated or surprising. We use the Mutate function from dplyr to calculate a time (years) in office and the current age of the living presidents (to 1 decimal point to break the ties). Surprisingly (maybe not) they served in order of age with George H. W. Bush being older by about 3 months. I tried using the Summarize function in dplyr, but with so few records it was no different than taking the mean. This will be a little different with all the presidents.
prez.former.live <- mutate(prez.former.live, InOffice = End - Start, Age = as.numeric(round((Sys.Date() - DOB)/365, 1)))
prez.former.live
## President DOB Start End InOffice Age
## 1 George H. W. Bush 1924-06-12 1989 1993 4 91.4
## 2 Jimmy Carter 1924-10-01 1977 1981 4 91.1
## 3 George W. Bush 1946-07-06 2001 2009 8 69.3
## 4 Bill Clinton 1946-08-19 1993 2001 8 69.2
mean(prez.former.live$Age)
## [1] 80.25
With the list of all US presidents we can now review Youqing Xiang list of questions and produce some more interesting results.
This is a good example to figure out what we mean by a question. Here is the list of presidents that served less than 4 years. Most were elected for a term, but did not serve the full 4 years. Gerald Ford was not even elected, but succeeded Nixon when he resigned and was not reelected. We have 10 presidents that served less than 4 years.
prez.all$Start <- as.Date(prez.all$Start, format = "%B %d, %Y")
prez.all$End <- as.Date(prez.all$End, format = "%B %d, %Y")
prez.all <- mutate(prez.all, Days = as.numeric(End - Start), Years = round(Days / 365, 1))
arrange(subset(prez.all, Years < 4, select = c(-State, -Party)), Years)
## President Born Died Start End Days Years
## 1 William Henry Harrison 1773 1841 1841-03-04 1841-04-04 31 0.1
## 2 James A. Garfield 1831 1881 1881-03-04 1881-09-19 199 0.5
## 3 Zachary Taylor 1784 1850 1849-03-04 1850-07-09 492 1.3
## 4 Warren G. Harding 1865 1923 1921-03-04 1923-08-02 881 2.4
## 5 Gerald Ford 1913 2006 1974-08-09 1977-01-20 895 2.5
## 6 Millard Fillmore 1800 1874 1850-07-09 1853-03-04 969 2.7
## 7 John F. Kennedy 1917 1963 1961-01-20 1963-11-22 1036 2.8
## 8 Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 1262 3.5
## 9 John Tyler 1790 1862 1841-04-04 1845-03-04 1430 3.9
## 10 Andrew Johnson 1808 1875 1865-04-15 1869-03-04 1419 3.9
Here is the list of presidents that served 4 years We get 14 presidents that served their 4 years. We have (10 + 14) 24 presidents either elected to 1 term or finishing a term or just over half (54%) of all presidents.
arrange(subset(prez.all, Years == 4, select = c(-State, -Party)), Years)
## President Born Died Start End Days Years
## 1 John Adams 1735 1826 1797-03-04 1801-03-04 1460 4
## 2 John Quincy Adams 1767 1848 1825-03-04 1829-03-04 1461 4
## 3 Martin Van Buren 1782 1862 1837-03-04 1841-03-04 1461 4
## 4 James K. Polk 1795 1849 1845-03-04 1849-03-04 1461 4
## 5 Franklin Pierce 1804 1869 1853-03-04 1857-03-04 1461 4
## 6 James Buchanan 1791 1868 1857-03-04 1861-03-04 1461 4
## 7 Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 1461 4
## 8 Grover Cleveland 1837 1908 1885-03-04 1889-03-04 1461 4
## 9 Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 1461 4
## 10 Grover Cleveland 1837 1908 1893-03-04 1897-03-04 1461 4
## 11 William Howard Taft 1857 1930 1909-03-04 1913-03-04 1461 4
## 12 Herbert Hoover 1874 1964 1929-03-04 1933-03-04 1461 4
## 13 Jimmy Carter 1924 <NA> 1977-01-20 1981-01-20 1461 4
## 14 George H. W. Bush 1924 <NA> 1989-01-20 1993-01-20 1461 4
We get a similar situation when we look at two-term presidents (or more terms in 1 case). How do we count them? Abraham Lincoln was assassinated months after being elected to a second term, as was William McKinley. Lyndon Johnson, Calvin Coolidge, Theodore Roosevelt and Harry Truman finished terms for presidents that died in office, then got (re-) elected. Nixon had to resign office after being reelected. Here is the list of presidents that served more than 4 years, but less than 8. George Washington shows up because technically he served just under 8 years, but they were still working out the details. We get 8 that were elected to a second term (even if they were not elected to the first), but did not serve 8 years in office.
arrange(subset(prez.all, (Years > 4 & Years < 8), select = c(-State, -Party)), Years)
## President Born Died Start End Days Years
## 1 Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 1503 4.1
## 2 William McKinley 1843 1901 1897-03-04 1901-09-14 1654 4.5
## 3 Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 1886 5.2
## 4 Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 2041 5.6
## 5 Richard Nixon 1913 1994 1969-01-20 1974-08-09 2027 5.6
## 6 Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 2728 7.5
## 7 George Washington 1732 1799 1789-04-30 1797-03-04 2865 7.8
## 8 Harry S. Truman 1884 1972 1945-04-12 1953-01-20 2840 7.8
How many presidents served 8 years or more? This gives us 12 presidents that served 8 years or more, 11 of these were regular 2-terms, the 12th was Franklin D. Roosevelt, who was elected to his 4th term when he died in office a few months after winning. After FDR we put in the 2-term limit.
arrange(subset(prez.all, Years >= 8, select = c(-State, -Party)), Years)
## President Born Died Start End Days Years
## 1 Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 2922 8.0
## 2 James Madison 1751 1836 1809-03-04 1817-03-04 2922 8.0
## 3 James Monroe 1758 1831 1817-03-04 1825-03-04 2922 8.0
## 4 Andrew Jackson 1767 1845 1829-03-04 1837-03-04 2922 8.0
## 5 Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 2922 8.0
## 6 Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 2922 8.0
## 7 Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 2922 8.0
## 8 Ronald Reagan 1911 2004 1981-01-20 1989-01-20 2922 8.0
## 9 Bill Clinton 1946 <NA> 1993-01-20 2001-01-20 2922 8.0
## 10 George W. Bush 1946 <NA> 2001-01-20 2009-01-20 2922 8.0
## 11 Barack Obama 1961 <NA> 2009-01-20 2017-01-20 2922 8.0
## 12 Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 4422 12.1
We had 24 presidents that in some way can be called 1-term presidents and we had 20 that were 2-term or more. Here is the full list sorted by days in office that shows the range of 31 to 4,422 days in office. The average number of days in office is 1,890 or just over 5 years.
# arrange(prez.all, Days, End)
arrange(subset(prez.all, Days > 0, select = c(-State, -Party)), Days, End)
## President Born Died Start End Days Years
## 1 William Henry Harrison 1773 1841 1841-03-04 1841-04-04 31 0.1
## 2 James A. Garfield 1831 1881 1881-03-04 1881-09-19 199 0.5
## 3 Zachary Taylor 1784 1850 1849-03-04 1850-07-09 492 1.3
## 4 Warren G. Harding 1865 1923 1921-03-04 1923-08-02 881 2.4
## 5 Gerald Ford 1913 2006 1974-08-09 1977-01-20 895 2.5
## 6 Millard Fillmore 1800 1874 1850-07-09 1853-03-04 969 2.7
## 7 John F. Kennedy 1917 1963 1961-01-20 1963-11-22 1036 2.8
## 8 Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 1262 3.5
## 9 Andrew Johnson 1808 1875 1865-04-15 1869-03-04 1419 3.9
## 10 John Tyler 1790 1862 1841-04-04 1845-03-04 1430 3.9
## 11 John Adams 1735 1826 1797-03-04 1801-03-04 1460 4.0
## 12 John Quincy Adams 1767 1848 1825-03-04 1829-03-04 1461 4.0
## 13 Martin Van Buren 1782 1862 1837-03-04 1841-03-04 1461 4.0
## 14 James K. Polk 1795 1849 1845-03-04 1849-03-04 1461 4.0
## 15 Franklin Pierce 1804 1869 1853-03-04 1857-03-04 1461 4.0
## 16 James Buchanan 1791 1868 1857-03-04 1861-03-04 1461 4.0
## 17 Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 1461 4.0
## 18 Grover Cleveland 1837 1908 1885-03-04 1889-03-04 1461 4.0
## 19 Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 1461 4.0
## 20 Grover Cleveland 1837 1908 1893-03-04 1897-03-04 1461 4.0
## 21 William Howard Taft 1857 1930 1909-03-04 1913-03-04 1461 4.0
## 22 Herbert Hoover 1874 1964 1929-03-04 1933-03-04 1461 4.0
## 23 Jimmy Carter 1924 <NA> 1977-01-20 1981-01-20 1461 4.0
## 24 George H. W. Bush 1924 <NA> 1989-01-20 1993-01-20 1461 4.0
## 25 Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 1503 4.1
## 26 William McKinley 1843 1901 1897-03-04 1901-09-14 1654 4.5
## 27 Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 1886 5.2
## 28 Richard Nixon 1913 1994 1969-01-20 1974-08-09 2027 5.6
## 29 Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 2041 5.6
## 30 Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 2728 7.5
## 31 Harry S. Truman 1884 1972 1945-04-12 1953-01-20 2840 7.8
## 32 George Washington 1732 1799 1789-04-30 1797-03-04 2865 7.8
## 33 Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 2922 8.0
## 34 James Madison 1751 1836 1809-03-04 1817-03-04 2922 8.0
## 35 James Monroe 1758 1831 1817-03-04 1825-03-04 2922 8.0
## 36 Andrew Jackson 1767 1845 1829-03-04 1837-03-04 2922 8.0
## 37 Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 2922 8.0
## 38 Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 2922 8.0
## 39 Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 2922 8.0
## 40 Ronald Reagan 1911 2004 1981-01-20 1989-01-20 2922 8.0
## 41 Bill Clinton 1946 <NA> 1993-01-20 2001-01-20 2922 8.0
## 42 George W. Bush 1946 <NA> 2001-01-20 2009-01-20 2922 8.0
## 43 Barack Obama 1961 <NA> 2009-01-20 2017-01-20 2922 8.0
## 44 Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 4422 12.1
mean(prez.all$Days)
## [1] 1890.341
mean(prez.all$Days) / 365.25
## [1] 5.175471
The oldest living president from our table above is George H. W. Bush who is about 3 months older than Jimmy Carter. If we want to see which presidents lived to be the oldest I show ages in years below (Gerald Ford and Ronald Reagan both died at 93). I also show their age at the end of their term to show how old they were in office (the oldest in office was Reagan at 78) and the average age was 71.
prez.all$Died[is.na(prez.all$Died)] <- "2015" # to get an age for living presidents
prez.all <- mutate(prez.all, Age = as.numeric(Died) - as.numeric(Born))
# arrange(prez.all, desc(Age), End)
arrange(subset(prez.all, Days > 0, select = c(-State, -Party)), desc(Age), End)
## President Born Died Start End Days Years Age
## 1 Gerald Ford 1913 2006 1974-08-09 1977-01-20 895 2.5 93
## 2 Ronald Reagan 1911 2004 1981-01-20 1989-01-20 2922 8.0 93
## 3 John Adams 1735 1826 1797-03-04 1801-03-04 1460 4.0 91
## 4 Jimmy Carter 1924 2015 1977-01-20 1981-01-20 1461 4.0 91
## 5 George H. W. Bush 1924 2015 1989-01-20 1993-01-20 1461 4.0 91
## 6 Herbert Hoover 1874 1964 1929-03-04 1933-03-04 1461 4.0 90
## 7 Harry S. Truman 1884 1972 1945-04-12 1953-01-20 2840 7.8 88
## 8 James Madison 1751 1836 1809-03-04 1817-03-04 2922 8.0 85
## 9 Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 2922 8.0 83
## 10 John Quincy Adams 1767 1848 1825-03-04 1829-03-04 1461 4.0 81
## 11 Richard Nixon 1913 1994 1969-01-20 1974-08-09 2027 5.6 81
## 12 Martin Van Buren 1782 1862 1837-03-04 1841-03-04 1461 4.0 80
## 13 Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 2922 8.0 79
## 14 Andrew Jackson 1767 1845 1829-03-04 1837-03-04 2922 8.0 78
## 15 James Buchanan 1791 1868 1857-03-04 1861-03-04 1461 4.0 77
## 16 Millard Fillmore 1800 1874 1850-07-09 1853-03-04 969 2.7 74
## 17 James Monroe 1758 1831 1817-03-04 1825-03-04 2922 8.0 73
## 18 William Howard Taft 1857 1930 1909-03-04 1913-03-04 1461 4.0 73
## 19 John Tyler 1790 1862 1841-04-04 1845-03-04 1430 3.9 72
## 20 Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 1461 4.0 71
## 21 Grover Cleveland 1837 1908 1885-03-04 1889-03-04 1461 4.0 71
## 22 Grover Cleveland 1837 1908 1893-03-04 1897-03-04 1461 4.0 71
## 23 Bill Clinton 1946 2015 1993-01-20 2001-01-20 2922 8.0 69
## 24 George W. Bush 1946 2015 2001-01-20 2009-01-20 2922 8.0 69
## 25 William Henry Harrison 1773 1841 1841-03-04 1841-04-04 31 0.1 68
## 26 Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 1461 4.0 68
## 27 Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 2922 8.0 68
## 28 George Washington 1732 1799 1789-04-30 1797-03-04 2865 7.8 67
## 29 Andrew Johnson 1808 1875 1865-04-15 1869-03-04 1419 3.9 67
## 30 Zachary Taylor 1784 1850 1849-03-04 1850-07-09 492 1.3 66
## 31 Franklin Pierce 1804 1869 1853-03-04 1857-03-04 1461 4.0 65
## 32 Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 1886 5.2 65
## 33 Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 2922 8.0 63
## 34 Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 4422 12.1 63
## 35 Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 2728 7.5 61
## 36 Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 2041 5.6 61
## 37 William McKinley 1843 1901 1897-03-04 1901-09-14 1654 4.5 58
## 38 Warren G. Harding 1865 1923 1921-03-04 1923-08-02 881 2.4 58
## 39 Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 1262 3.5 57
## 40 Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 1503 4.1 56
## 41 James K. Polk 1795 1849 1845-03-04 1849-03-04 1461 4.0 54
## 42 Barack Obama 1961 2015 2009-01-20 2017-01-20 2922 8.0 54
## 43 James A. Garfield 1831 1881 1881-03-04 1881-09-19 199 0.5 50
## 44 John F. Kennedy 1917 1963 1961-01-20 1963-11-22 1036 2.8 46
prez.all <- mutate(prez.all, InOffice = as.numeric(year(ymd(End))) - as.numeric(Born))
# arrange(prez.all, desc(InOffice))
arrange(subset(prez.all, Days > 0, select = c(-State, -Party, -Days, -Years)), desc(InOffice))
## President Born Died Start End Age InOffice
## 1 Ronald Reagan 1911 2004 1981-01-20 1989-01-20 93 78
## 2 Dwight D. Eisenhower 1890 1969 1953-01-20 1961-01-20 79 71
## 3 Andrew Jackson 1767 1845 1829-03-04 1837-03-04 78 70
## 4 James Buchanan 1791 1868 1857-03-04 1861-03-04 77 70
## 5 Harry S. Truman 1884 1972 1945-04-12 1953-01-20 88 69
## 6 George H. W. Bush 1924 2015 1989-01-20 1993-01-20 91 69
## 7 William Henry Harrison 1773 1841 1841-03-04 1841-04-04 68 68
## 8 James Monroe 1758 1831 1817-03-04 1825-03-04 73 67
## 9 John Adams 1735 1826 1797-03-04 1801-03-04 91 66
## 10 Thomas Jefferson 1743 1826 1801-03-04 1809-03-04 83 66
## 11 James Madison 1751 1836 1809-03-04 1817-03-04 85 66
## 12 Zachary Taylor 1784 1850 1849-03-04 1850-07-09 66 66
## 13 George Washington 1732 1799 1789-04-30 1797-03-04 67 65
## 14 Woodrow Wilson 1856 1924 1913-03-04 1921-03-04 68 65
## 15 Gerald Ford 1913 2006 1974-08-09 1977-01-20 93 64
## 16 Franklin D. Roosevelt 1882 1945 1933-03-04 1945-04-12 63 63
## 17 George W. Bush 1946 2015 2001-01-20 2009-01-20 69 63
## 18 John Quincy Adams 1767 1848 1825-03-04 1829-03-04 81 62
## 19 Andrew Johnson 1808 1875 1865-04-15 1869-03-04 67 61
## 20 Lyndon B. Johnson 1908 1973 1963-11-22 1969-01-20 65 61
## 21 Richard Nixon 1913 1994 1969-01-20 1974-08-09 81 61
## 22 Benjamin Harrison 1833 1901 1889-03-04 1893-03-04 68 60
## 23 Grover Cleveland 1837 1908 1893-03-04 1897-03-04 71 60
## 24 Martin Van Buren 1782 1862 1837-03-04 1841-03-04 80 59
## 25 Rutherford B. Hayes 1822 1893 1877-03-04 1881-03-04 71 59
## 26 Herbert Hoover 1874 1964 1929-03-04 1933-03-04 90 59
## 27 William McKinley 1843 1901 1897-03-04 1901-09-14 58 58
## 28 Warren G. Harding 1865 1923 1921-03-04 1923-08-02 58 58
## 29 Calvin Coolidge 1872 1933 1923-08-02 1929-03-04 61 57
## 30 Jimmy Carter 1924 2015 1977-01-20 1981-01-20 91 57
## 31 Abraham Lincoln 1809 1865 1861-03-04 1865-04-15 56 56
## 32 Chester A. Arthur 1829 1886 1881-09-19 1885-03-04 57 56
## 33 William Howard Taft 1857 1930 1909-03-04 1913-03-04 73 56
## 34 Barack Obama 1961 2015 2009-01-20 2017-01-20 54 56
## 35 John Tyler 1790 1862 1841-04-04 1845-03-04 72 55
## 36 Ulysses S. Grant 1822 1885 1869-03-04 1877-03-04 63 55
## 37 Bill Clinton 1946 2015 1993-01-20 2001-01-20 69 55
## 38 James K. Polk 1795 1849 1845-03-04 1849-03-04 54 54
## 39 Millard Fillmore 1800 1874 1850-07-09 1853-03-04 74 53
## 40 Franklin Pierce 1804 1869 1853-03-04 1857-03-04 65 53
## 41 Grover Cleveland 1837 1908 1885-03-04 1889-03-04 71 52
## 42 Theodore Roosevelt 1858 1919 1901-09-14 1909-03-04 61 51
## 43 James A. Garfield 1831 1881 1881-03-04 1881-09-19 50 50
## 44 John F. Kennedy 1917 1963 1961-01-20 1963-11-22 46 46
round(mean(prez.all$Age),0)
## [1] 71
To wrap up I used the Count function from dplyr to summarize the data. We answer the old questions of which state (NY) and which party (Republican) had the most presidents.
count(prez.all, State, sort = TRUE)
## Source: local data frame [17 x 2]
##
## State n
## (chr) (int)
## 1 New York 8
## 2 Ohio 6
## 3 Virginia 5
## 4 Massachusetts 4
## 5 California 3
## 6 Illinois 3
## 7 Tennessee 3
## 8 Texas 3
## 9 Arkansas 1
## 10 Georgia 1
## 11 Indiana 1
## 12 Louisiana 1
## 13 Michigan 1
## 14 Missouri 1
## 15 New Hampshire 1
## 16 New Jersey 1
## 17 Pennsylvania 1
count(prez.all, Party, sort = TRUE)
## Source: local data frame [6 x 2]
##
## Party n
## (chr) (int)
## 1 Republican 18
## 2 Democratic 16
## 3 Democratic-Republican 4
## 4 Whig 4
## 5 Federalist 1
## 6 Independent 1
I got the idea to examine Amazon reviews and the Banana slicer specifically from classmate Joy Peyton. I believe web scraping text and reviews in particular may be something I need to do. I started with reading data in from Amazon, getting the data into a dataframe, and thought I would focus on sentiment analysis.
This is very similar to the work on the presidents. I looked at the table information to see if I could read things in almost directly, like with the living presidents’ data. That did not seem possible, however the data selectors were clearly mark and I was able to get a tag for each field or column I wanted. I chose:
amazon.html <- read_html("http://www.amazon.com/Hutzler-571-Banana-Slicer/product-reviews/B0047E0EII/ref=cm_cr_dp_see_all_btm?ie=UTF8&showViewpoints=1&sortBy=bySubmissionDateDescending", encoding = "UTF-8")
title <- html_nodes(amazon.html, ".a-color-base.a-text-bold")
title <- html_text(title)
author <- html_nodes(amazon.html, ".author")
author <- html_text(author)
date <- html_nodes(amazon.html, "#cm_cr-review_list .review-date")
date <- html_text(date)
date <- as.Date(mdy(gsub("on ", "", date)))
rate <- html_nodes(amazon.html, "#cm_cr-review_list .review-rating")
rate <- html_text(rate)
rate <- as.numeric(gsub(" out of 5 stars", "", rate))
review <- html_nodes(amazon.html, ".review-data+ .review-data")
review <- html_text(review)
reviews <- data.frame(author, title, rate, date, review, stringsAsFactors = FALSE)
reviews
## author
## 1 Sarah G.
## 2 Brandon Braud
## 3 Amazon Customer
## 4 M. Underhill
## 5 R. Steen
## 6 Janice Konstantinidis
## 7 onlinegirl
## 8 paula
## 9 Chameleon
## 10 Fritz Finley
## title rate date
## 1 Independence slicer! 5 2015-10-12
## 2 surprised by elegant efficacy. 5 2015-10-12
## 3 Life changing 5 2015-10-10
## 4 Five Stars 5 2015-10-10
## 5 Five Stars 5 2015-10-09
## 6 In eternal wonderment of the Hutzler 521 Banana Slicer 5 2015-10-08
## 7 Just don't go in the water with it! 3 2015-10-08
## 8 Do Not Buy 2 2015-10-05
## 9 Disappointed... 1 2015-10-05
## 10 Five Stars 5 2015-10-02
## review
## 1 My twins LOVE making their own banana snack. We started using this when they were about 18 months and have had no problems or worries that they would cut themselves. 1 year olds + the banana slicer = independence = one happy momma!!
## 2 It work for turds also. Don't put slice turd on sereal tho.
## 3 I used to think people who bought this are weirdos. But i wanted to see what the heck the hype was about so I got one for myself. Let me tell you, OH MY GOD. It changed my way of life forever! who has time to get a knife when all you want is a sliced banana? Worry no more, this is everything you ever needed in life! Easy to use and clean, you'll have your slice banana in no time! I always keep this amazing device in my bag since I never know when I feel like eating sliced bananas. Plain whole bananas just won't do! They have to be sliced by this device!
## 4 They've done studies, you know. 60% of the time it works every time.
## 5 Game changer. Everything you knew about slicing bananas has been turned on its head.
## 6 I was given this Banana Slicer as a gift. I am nothing short of amazed at how it's changed my life. I can't think of one aspect of my existence that this product has not enhanced. I woke up this morning with a new joy de vivre, raison de'etre. My consciousness flooded with the knowledge that I have the Banana Slicer. I immediately began to meditate on it. I felt a new awareness and appreciation of all things. Cutting my morning banana was an artform. Later in the day I began to visualize it as I went about my daily chores. No longer daunted by the mediocrity of usual daily imperatives, I delved into my inner self for answers I'd been seeking, and voila I was rewarded. I had found the meaning of my life. I had found my truth, my metaphysical self. My all. Thank you Teri. Elizabeth Barrett Browning, 1806 - 1861 How do I love thee? Let me count the ways.I love thee to the depth and breadth and heightMy soul can reach, when feeling out of sightFor the ends of being and ideal grace.I love thee to the level of every dayâ<U+0080><U+0099>sMost quiet need, by sun and candle-light.I love thee freely, as men strive for right.I love thee purely, as they turn from praise.I love thee with the passion put to useIn my old griefs, and with my childhoodâ<U+0080><U+0099>s faith.I love thee with a love I seemed to loseWith my lost saints. I love thee with the breath,Smiles, tears, of all my life; and, if God choose,I shall but love thee better after death.
## 7 Uh...You're going to need a bigger banana.
## 8 not sharp enough and it smushed the banana as it "cuts" through it. Not worth any price.
## 9 In the description for the Hutzler 571 Banana Slicer, it says "Great for cereal".But I found that it's not great for cereal at all! Not only did it cut my tongue, but it was hard to chew and didn't taste very good. What were they thinking? I'm going back to oatmeal.
## 10 Banana Fun!
We can see from above that we were able to read most data straight from the web page with the rvest function html_text. Before I describe the minimal tidying needed to get this web-page-wide information into a fairly narrow dataframe, I want to discuss a problem that I feel is more about Retrieving.
There are over 5,000 reviews of the Hutzler 571 Banana Slicer, but I only seem to be able to access them 10 at a time. I spent some time trying to figure this out and ran out of time. Maybe a web developer can give me some ideas. This 10 at a time problem would greatly hinder the usefulness of using R to analyze web-based reviews.
Back to tidying the data. When reading in the dates I captured “on March 3, 2011” and so on for each date. I used gsub to remove the “on” and space. I used mdy from the lubridate package in combination with as.Date to get the dates ready for a dataframe. date <- as.Date(mdy(gsub(“on”, “”, date)))
In a similar way I used gsub again to pull " out of 5 stars" out of my rating and as.numeric to convert the remaining character number into something we can add. rate <- as.numeric(gsub(" out of 5 stars“,”“, rate))
I spent more time trying to make the review text field display better in the dataframe print to no avail. When I display a single column it looks fine, like one character string, but not so when I display the entire dataframe. And, tbl_df from dplyr did not help.
tbl_df(reviews)
## Source: local data frame [10 x 5]
##
## author
## (chr)
## 1 Sarah G.
## 2 Brandon Braud
## 3 Amazon Customer
## 4 M. Underhill
## 5 R. Steen
## 6 Janice Konstantinidis
## 7 onlinegirl
## 8 paula
## 9 Chameleon
## 10 Fritz Finley
## Variables not shown: title (chr), rate (dbl), date (date), review (chr)
What really got me interested in this example was Joy’s idea on sentiment analysis and the concept of a satire detector. I have done a simple sentiment analysis in Python before and I hoped to leverage that work.
My sketched out ideas amount to this.
I ran out of time before I got the dictionary working well enough to test a function. My limited data set would have made it difficult to test for satire.
I can give you the average rating from my 10 reviews …
round(mean(reviews$rate),1)
## [1] 4.1