Allow me to introduce myself

Greetings, I am an R Markdown document. R Markdown is a simple formatting syntax for authoring HTML, PDF, and Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. RStudio also has cheatsheets that you can access by going to Help > Cheatsheets. This means you never have to memorize any of the commands you see here (yay!). Besides, you’ll eventually remember them if you use them often enough.

To transform me into Word or HTML, click the Knit button, and I will generate your specified file that includes both content as well as the output of any embedded R code chunks within the document. In this document, the output in the YAML is html_document, so I will transform into a web page. Don’t click Knit just yet though, you’ll have to fill in some missing lines of code below or R will throw a tantrum.

One advantage of R Markdown is that you can embed code chunks like so:

# This is a code chunk
# R expects R code here
# But since I placed a hashtag to the left of this sentence
# R will ignore everything I type to the right of the hashtag
# This is why I can type human readable sentences here
summary(cars) # show a summary of the variables in the dataset cars

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

How does R know it’s a code chunk? Well, we start with three backticks (that’s the key to the left of the 1 key on your keyboard; the one we use to type tilde ~) and a set of braces { } that contain the letter r. This indicates that what follows will be R code and not text. When we are done with our code, we must close the chunk with another three backticks.

See the green button that looks like a play button inside the code chunk? If you click on that, R Markdown will execute the code in the console and display the results right here. Try it! Cool huh? :) The editor shades all code chunks so they are easily identified.

Including plots

We can use code chunks to embed plots in our final document. For example:

As we work in the R Markdown document, all we see is the code. If we want to see what the code will create, we… yes, click the green play button. When we knit the document, R will execute the code in the code chunk and embed the plot automatically in the final document (which could be a Word, PDF, or HTML document). In this document, we have asked for a web page, so when we click the Knit button, we will get a web page.

The echo = FALSE parameter added to the previous code chunk prevents printing of the R code that generated the plot in the knitted file.

You might wonder: What’s the big deal? Well, imagine you are working on a project where you’ll generate a few graphs and write a report based on those graphs. You’ll probably use Word to type stuff, switch to Excel to create your plots, then copy-and-paste (or embed) those plots from Excel back into Word. That’s way. too. much. work.

R Markdown allows you to do all of that in one program. You type your words, then use a code chunk to tell R to create a plot, and continue writing. This is not only more efficient, it reduces human error from copying-and-pasting. More importantly, since everything is in one file (including the code to create the plots), it is reproducible. That is, because you are typing code instead of using the mouse, you have a complete record of what you did. Not only that, you can send someone else your R Markdown file, and they can recreate (reproduce) your report! This allows us to check our work and ensure quality control.

Still unconvinced? Consider the following scenario. You wrote up a report in Word, and copied-and-pasted plots from Excel into your Word report. Now your boss says, “We have some updated data. Can you re-create those plots?” Now you’ll have to open Excel again, use the mouse to create your plots with the new data, and copy-and-paste the updated plots back into Word.

In R Markdown, all you need to do is change the code you wrote in a code chunk to reflect the new dataset, click knit, and voila! An updated report with the new plots!

Where knitted files go

All knitted files (Word, HTML, PDF) go in the same folder as the R Markdown file. For instance, if your R Markdown file is in:

C:/Users/your_user_name/Downloads/my_first_text_analysis.Rmd

This is where the Word (or HTML) file will go.

Analysing Jane Austen

Prelude

The power of R lies in its global community of users who author and share their code to perform pretty much any task you can imagine. These are called packages, and as long as you have internet access, you can download them to your machine.

To install a package, we use the install.packages("name_of_package_we_want_to_install") command. After we’ve installed a package, we use the library(name_of_package_we_want_to_load) command to load the package.

This process is similar to what we do with our phones in the Play or App Store. If we want an app, we first have to install it. This is like install.packages(""). After we’ve installed the app, we have to open it in order to use it. This is like library().

We only have to install a package once, but we have to load it each time we want to use it. This is exactly the same as our phones, no?

Let’s now install the following packages:

tidyverse
tidytext
janeaustenr

install.packages("tidyverse") # installs the tidyverse package
install.packages("tidytext") # installs the tidytext package
install.packages("janeaustenr") # installs the janeaustenr package

# If we want to install them all at one go:
# install.packages(c("tidyverse", "tidytext", "janeaustenr"))

Notice how I use comments to describe what the line of code will do. R ignores everything to the right of the # symbol, so be sure to use that to document what your code does. This way, you will remember what you did, and if you do decide to share your code with others, they will know what you did.

What do we have to do after we’ve installed the packages and want to use them? Type the code in the code chunk below. I’ve loaded the first package, you do the remaining two.

library(tidyverse)  # loads the tidyverse package
   library(tidytext)# loads the tidytext package
   library(janeaustenr)# loads the janeaustenr package

Let’s Get to Work!

The janeaustenr package contains the text of Jane Austen’s six completed and published novels in a tidy format. This is in the dataset called austen_books(). Let’s group by each novel, create a column called line that labels each sentence in the novel, and assign the result to an object called original_books.

original_books <- austen_books() %>% # take austen_books, *then*
  group_by(book) %>% # group by each book, *then*
  mutate(line = row_number()) %>% # create a new variable called line, *then*
  ungroup() # ungroup the books

Next, let’s restructure original_books into a one-token-per-row format and assign the result to an object called tidy_austen. Do you remember how to unnest tokens?

tidy_austen <- original_books %>%
  unnest_tokens(word, text)# type the unnest tokens code here

Now we remove stopwords like so:

tidy_austen <- tidy_austen %>%
  anti_join(get_stopwords()) # get stopwords and remove them from tidy_austen

## Joining, by = "word"

Let’s plot!

Let’s do a bar graph to visualise Jane Austen’s most common words across all six books. To keep the graph informative, let’s look only at words that Austen used more than 1,500 times.

There are three missing pieces of code you have to complete in the code chunk below!

tidy_austen %>%
  count(word, sort = TRUE) %>% # count the number of words and sort them by frequency
  filter(n > 1500) %>% # filters the data to get only words that are used more than 1,000 times
  mutate(word = reorder(word, n)) %>% # creates a variable to contain the frequent words
  ggplot() + # plot function
  aes(x = word , y = n) + # word on the x-axis, count (n) on the y-axis
  geom_col() + # we want to plot *col*umns
  coord_flip() # let's flip the axes

What about the most common words in Pride and Prejudice?

tidy_austen %>%
  filter(book == "Pride & Prejudice") %>%
  count(word, sort = TRUE) %>%
  filter(n > 250) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot() + # plot function
  aes(x = word, y = n) + # word on the x-axis, count on the y-axis
  geom_col() + # we want to plot columns
  coord_flip() # let's flip the axes

Analysing sentiments

Finally, let’s see how sentiment changes across each novel. To do this, we will use the Bing sentiment lexicon (no relation to the searrch engine) to create a sentiment score for each word. Next, we will count the number of positive and negative words in defined sections of each novel.

austen_sentiment <- tidy_austen %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(book, index = line %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

Now, let’s plot these sentiment scores!

austen_sentiment %>%
  ggplot() + 
  aes(x = index, y = sentiment, fill = book) + 
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, scales = "free_x")

And there you have it, a quick introduction to text analysis in R! We’ve only just scratched the surface and there’s a lot more you can do. If you’re interested, check out the following free and open-source resources:

Text Mining with R. This tutorial draws heavily from the first chapter of this book.

R for Data Science

Data Visualization

Let’s end with something cool!

Before you can run the code below, you’ll need to install the package leaflet. Remember how to do that? Type that in the console and hit ENTER.

Let’s explore South Orange!

library(leaflet)
leaflet() %>%
  setView(lng = -74.251, lat = 40.748, zoom = 15) %>% 
  addTiles() %>%
  addMarkers(lng = -74.247, lat = 40.743, popup = "Seton Hall")

Go ahead and interact with the map! Don’t worry, R is not tracking you… or is it? No, it isn’t. Really. :P