Introduction

Hello everyone! This document was put together as an introduction to data visualization for the 6th Quantiative Workshop from CC Bio Insites. It’s ended up being a bit of a beast, so I recommend breaking it up into sections or just focusing on the elements that seem interesting or have what you need.

We put together a worksheet to prepare you for the workshop! All of the code can be copy/pasted into R and run using the ‘data_visualization_merged.csv’ file included in that folder.

If you’re new to R or want to brush up on data cleaning and the like, check out the documents from an earlier workshop:

Remember to be patient with yourself when working R and give yourself breaks when it gets frustrating! We’ll have time for Q&A during the workshop, and you can reach out to me with questions regarding this code or anything else R/stats related: hlperkin@purdue.edu. Enjoy!


Load Libraries and Data

The section below is used to load the required libraries and import the data. Nothing exciting happening here, just getting things ready for the rest of the code.

# load libraries ----------------------------------------------------------

library(ggplot2) #needed for all but the most basic plots
library(tidyverse) #used to reformat data from wide to long
library(Rmisc) #used to help create some of the plots
library("tm") #used to get text data into R for wordcloud
library(stringr) #used to clean text data
library(readr) #used to clean text data
library("wordcloud") #used to create the wordcloud
library(RColorBrewer) #used to make the wordcloud pretty

# import data -------------------------------------------------------------

data <- read.csv(file="Data/final_merged.csv", header = T) #make sure you edit this line to point to wherever your data lives

# remove columns with pre/post gender information since recoded variable is available
data <- subset(data, select=c(1,3,4,6,7,8))
head(data)
##   id salg_pre  vbs_pre salg_post vbs_post gender_rc
## 1  1    2.750 3.866667     2.625 3.800000         F
## 2  2    2.375 3.533333     2.250 3.133333         F
## 3  3    2.250 3.866667     3.625 3.933333         F
## 4  4    4.250 4.066667     4.250 3.933333         F
## 5  5    3.750 3.600000     3.500 3.933333         F
## 6  6    4.250 3.266667     5.000 4.333333         F

Histograms

Histograms are good for presenting information but they’re even better when importing, cleaning, and checking your data. They can be an easy way to spot whether something has gone wrong when coding your data, and are an important step when checking for univariate normality (e.g., skewness and kurtosis).

Basic Histogram

The simplest histogram possible. This is usually the version I use when checking my data, as it’s quick and simple.

hist(data$salg_pre) #remember, you can highlight the command and hit F1 to view help file, which details the various arguments you can use

hist(data$salg_pre, breaks = 4) #for instance, here I've used the 'breaks' argument to set the number of bins, which changes the plot significantly

Between-Groups Comparison

Another simple visualization that can be very effective. If your analysis involves comparing groups (e.g., men and women’s scores on a scale measuring science interest and confidence) then I recommend creating a plot like this before running any inferential statistics.

data_mf <- subset(data, gender_rc == "M" | gender_rc == "F") #create a new dataframe, limited to students who reponded 'male' or 'female' to a question about their gender. this drops anyone who left the response blank (n = 11) or identified as transgender or non-binary (n = 1)

ggplot(data = data_mf, aes(x=salg_pre, fill=gender_rc)) +
  geom_histogram(alpha=0.6, position = 'identity', bins = 10) +
  scale_fill_manual(values=c("#69b3a2", "#404080")) +
  labs(fill="Gender", caption = "Figure 1: Pre-test science interest and confidence by gender") + xlab("Science Interest & Confidence") + ylab("Frequency")
## Warning: Removed 25 rows containing non-finite values (stat_bin).

This is the first time we’ve used the ggplot() command, which plays a big role in creating visualizations in R. I’m not an expert, but I have a working understanding so I’ll try to provide some background information. The first thing to note is that people use lots of different ways to call the ggplot command, and many people use the %>% command (called a pipe. I’ve tried not to use it here, to avoid introducing extra confusion, but if you’re like me and steal code shamelessly from the internet you’ll definitely come across it. The link I provided above has some useful information, and Google is as always your friend.

As for the code I have written, here’s a quick breakdown of some of the arguments:

  • data - This argument is pretty straightforward; it tells ggplot() the name of your dataframe.
  • aes() - This argument points ggplot() at your data and tells it what’s what. ‘X’ is the x-axis and ‘Y’ is the y-axis. ‘Fill’ is one of many arguments you can use to indicate additional variables (in this case a between-subjects variable that we want to compare across). The y-axis is left unspecified, which is okay in this case when creating a histogram where the y-axis is always the frequency.
  • ‘+’ - You might not have noticed this little guy tagging along at the end of the line, but he’s actually very important. He tells R to run the current line and the next as one command. Try highlighting just the ggplot() command (ignore the +) and run it on its own to see what happens, and then highlight the ggplot() and geom_histogram() commands together to see how the output changes.
  • geom_histogram() - The first command gets ggplot() pointed in the right direction. This command tells it to start making plots. There are lots of ‘geom’ commands – we’ll use some different ones later. The arguments in this command tell ggplot() to make the output a little bit see-through (alpha), whether or not to stack the data (position), and how many bins to use (bins). You can tweak these settings to see what happens, and can always use F1 on the geom_histogram() command for more information and additional arguments.
  • scale_fill_manual() - This command builds on the previous one and sets custom colors. If you skip this line, the same plot it outputted but it uses red and blue. This is useful if you want to make your plots in black and white, or need to control the palette so its more accessible, or just want to change the colors.
  • labs(), xlab(), ylab() - ‘Lab’ in this case is shorthand for label. These commands allow you to customize your plot’s titles, axes, legend, and etc. Pretty straightforward.

Last but not least – if you run this code on this data, it will produce a warning message (“Removed 25 rows containing non-finite values (stat_bin)”). In RStudio, warning messages (and error messages) will output in red in your console. Warning messages are just R giving you a heads-up about something that happened in the background. In this case, this incomprehensible message means that it removed 25 cases with NAs. Error messages indicate that something went wrong, and are more serious.

Sometimes it’s not clear whether something is a warning or an error, as R gives the most useless error and warning messages known to man. This is another occasion where I recommend a healthy relationship with Google. People unfamiliar with R sometimes stress out over error and warning messages, especially when they’re full of useless gibberish, but after a while they just become annoying. Don’t be afraid to Google or reach out to people to troubleshoot – the scary red text doesn’t mean you’ve done something wrong, it just means that R is a dumb robot and needs some TLC.

Density Plots

Our ggplot() histogram is prettier than the basic histogram, but it’s still quite blocky and subject to the main issue with histograms: bin size. You can play with the code and see what happens to the plot if you specify a small number of bins (like 2) or a bunch (like 50). Changing the bin size in a histogram can radically alter how your distribution is presented. So if you really need to examine your distribution visually, density plots are a better option.

Basic Density Plot

As with the histogram, our first example is the most basic. Note that this code snippet uses stacked or nested commands.

plot(density(na.omit(data$salg_pre)))

Using Stacked/Nested Commands

There are three commands here: na.omit, density, and plot:

  • na.omit() - Removes entries with missing data
  • density() - Creates the density estimate
  • plot() - Outputs the density estimate as a plot.

Each command acts on the output produced by the command nested within it. Another, longer way to do the same thing would look like this:

data_nona <- na.omit(data$salg_pre)
densityest <- density(data_nona)
plot(densityest)

This time, each command uses the arrow command (‘<-’) to put the output in an object, which the next command then refers to. Sometimes this is a good way to do things, sometimes its not. For me, when I’m creating basic density plots, it’s early in my analysis when I’m looking at my variables and checking for issues or just getting familiar with the data. It doesn’t make sense to clutter up my workspace with a bunch of objects that I’m not going to use again, so I just use the long nested command that doesn’t output a bunch of extra stuff.

Saving Plots as R Objects

You may have noticed at this point that the plot I created isn’t saved in an object. You can save a plot as an object very easily using the arrow command referenced above. This can be useful if your plot takes a long time to run (very rare) or if you want to refer to it again in a later command (more common). It’s not something I do often in this document but it’s useful to keep the option in mind. Also note that saving a plot as an object will cause it not to appear in your Plots window in RStudio; you’ll have to call the object to get it to display. For example:

mynewplot <- plot(densityest)
mynewplot

The first line by itself will create the plot but not show it to you. The second will access the saved information and draw the visualization. If you end up with a bunch of objects that you want to get rid of, you can use the rm() command:

rm(data_nona, densityest, mynewplot)

If you want to get your objects back, just re-run your code.

Between-Groups Comparison

We can also use density plots to compare two distributions of the same variable. The code below is very similar to what we used to create the histogram – the only thing that’s changed is switching geom_histogram() to geom_density(). Since density plots don’t use bins, the ‘bins’ argument has also gone away (R will give you a warning if you leave it in, but the command will still run).

ggplot(aes(x=salg_pre, fill=gender_rc), data = data_mf) +
  geom_density(alpha=0.6, position = 'identity') +
  scale_fill_manual(values=c("#69b3a2", "#404080")) +
  labs(fill="Gender", caption = "Figure 1: Pre-test science interest and confidence by gender") + xlab("Science Interest & Confidence") + ylab("Frequency")
## Warning: Removed 25 rows containing non-finite values (stat_density).

You can compare the histograms and density plots to see how they visualize the same information in different ways. Some things, such as the shape of the distribution, are easier to see with a density plot. However, the difference in sample sizes is hidden by the density plot, which may be misleading in certain situations.

Group Comparison Barplots

Technically we’ve already created some barplots with our histograms above. Barplots are generally recommended for use with frequency data, but sometimes they get used to illustrate comparisons between groups.

Wide Format to Long

If you’re creating a non-histogram barplot with ggplot, you’ll need to make sure your data is in long format. Most of the time we use wide format, in which the columns indicate variables (both within- and between-subjects) and the rows indicate cases or participants. Long format is often used to collapse a wide dataframe into three columns: identifiers, variable names, and variable values. You can see an example using the code below:

head(data) #preview our original dataframe, using wide format
##   id salg_pre  vbs_pre salg_post vbs_post gender_rc
## 1  1    2.750 3.866667     2.625 3.800000         F
## 2  2    2.375 3.533333     2.250 3.133333         F
## 3  3    2.250 3.866667     3.625 3.933333         F
## 4  4    4.250 4.066667     4.250 3.933333         F
## 5  5    3.750 3.600000     3.500 3.933333         F
## 6  6    4.250 3.266667     5.000 4.333333         F
data_long <- gather(data, variable, value, salg_pre:vbs_post, factor_key = T) #using the gather() command to switch from wide to long
head(data_long) #preview our new long dataframe
##   id gender_rc variable value
## 1  1         F salg_pre 2.750
## 2  2         F salg_pre 2.375
## 3  3         F salg_pre 2.250
## 4  4         F salg_pre 4.250
## 5  5         F salg_pre 3.750
## 6  6         F salg_pre 4.250
head(data_long[order(data_long$id, decreasing=TRUE), ]) #preview sorted by id, so we can see that the same participant appears across multiple rows, with unique values for each variable
##      id gender_rc  variable    value
## 120 120         F  salg_pre 3.250000
## 240 120         F   vbs_pre 3.000000
## 360 120         F salg_post       NA
## 480 120         F  vbs_post       NA
## 119 119         F  salg_pre 2.500000
## 239 119         F   vbs_pre 3.866667

The above example collapsed the within-subjects variables (salg and vbs pre and post) and kept the between-subjects variable in its own column. The ggplot() command doesn’t always require this organization, it’s just a habit of mine from who knows where and it generally serves me well.

Basic Group Comparison Barplot

A lot of this code will look very familiar. A few things are done differently than the last time we used ggplot():

  • We now have a y-axis variable specified (‘value’).
  • We use the summarySE() command (a helpful command from the Rmisc package) to create summary information for our data (e.g., means, standard deviations/errors, and confidence intervals). Because we’re using the table created by the summarySE() command, we refer to the ‘summ’ object instead of our dataframe.
  • A new R object called ‘barcaption’ is created. This object consists of a string wrapped in the str_wrap() command, which tells R to insert a linebreak in the string after every 50 characters (it’s smart enough to not chop words). This is because the labs() command isn’t smart, and will let the caption run off the side of the image and get cut off. If you’re viewing this in the Rpubs document, it probably doesn’t make much of a difference; if you’re running the code yourself in RStudio, the small plots window means that part of your caption gets cut off. Depending on how large your plot will be, you may or may not need to use the str_wrap() command.
data_long_mf <- subset(data_long, gender_rc == "M" | gender_rc == "F") #create a new dataframe from our long format data that's limited to male and female students
summ <- summarySE(data_long_mf, measurevar="value", groupvars = c("variable","gender_rc"), na.rm = T) #the summarySE() command creates a table with the information we need for our plot
barcaption <- str_wrap("Figure 2: Pre- and post-test science interest/confidence and views and barriers towards schooling", 50) #creates our wrapped caption

ggplot(summ, aes(x=variable, y=value, fill=gender_rc)) + 
  geom_bar(position=position_dodge(), stat="identity") +
  labs(caption = barcaption) + xlab("Mean") + ylab("Variable")

Group Comparison Barplot with Error Bars

This code makes a simple addition: it adds the geom_errorbar() command. This command uses the ‘ymin’ and ‘ymax’ arguments and some math to figure out the floor and ceiling values for the error bars, and then draws them using the rest of the arguments. Everything else stays the same.

ggplot(summ, aes(x=variable, y=value, fill=gender_rc)) + 
  geom_bar(position=position_dodge(), stat="identity") +
  geom_errorbar(aes(ymin=value-se, ymax=value+se),
                width=.2,
                position=position_dodge(.9)) +
  labs(caption = barcaption) + xlab("Mean") + ylab("Variable")

Boxplots

In some cases, a boxplot is a better option than a barplot for showing a comparison between groups. It still allows for the comparison of means, but provides more information about the distribution of your data. The tricky thing about boxplots is the assumptions that many people make when viewing them, that lead to misinterpretations and misunderstandings. The image below breaks down the elements of the default boxplot (image is stolen from a site that talks about a non-R program, but might be a useful read).

All of these elements can be customized in R, but we’ll focus on creating the basic version and then tweaking the box itself.

Basic Boxplot

Note that boxplots also use long format data.

boxplot(data_long_mf$value ~ data_long_mf$variable, ylab="Value" , xlab="Variable")

Between-Groups Comparison

Again, much of this ggplot code is identical to what we’ve already used. Note that the ‘caption’ argument uses the barcaption object that we created earlier.

ggplot(data = data_long_mf, aes(x=variable, y=value, fill=gender_rc)) +
  geom_boxplot() +
  labs(caption = barcaption) + xlab("Mean") + ylab("Variable")
## Warning: Removed 118 rows containing non-finite values (stat_boxplot).

This is an alternative version that uses confidence intervals instead of quartiles to draw the boxes. It requires a bit of code to make work, but if you want to customize it for your own data you can simply replace ‘gender_rc’ with the name of your own between-subjects variable. This boxplot presents a lot of the same information from the between-groups barplot above, but gives a little bit more information about the distributions.

summ2 <- merge(summ, merge(
  setNames(aggregate(x = data_long_mf$value, by = list(data_long_mf$variable, data_long_mf$gender_rc), FUN = "max", na.rm = T), c("variable","gender_rc","max")),
  setNames(aggregate(x = data_long_mf$value, by = list(data_long_mf$variable, data_long_mf$gender_rc), FUN = "min", na.rm = T), c("variable","gender_rc","min"))))
summ2
##    variable gender_rc  N    value        sd         se        ci      max
## 1 salg_post         F 52 3.627404 0.7884314 0.10933576 0.2195007 5.000000
## 2 salg_post         M 22 3.693182 0.7931320 0.16909632 0.3516550 5.000000
## 3  salg_pre         F 57 3.140351 0.7856377 0.10406026 0.2084578 4.875000
## 4  salg_pre         M 26 3.629808 0.6046495 0.11858151 0.2442232 4.625000
## 5  vbs_post         F 52 3.553846 0.6350048 0.08805932 0.1767865 4.666667
## 6  vbs_post         M 22 3.578788 0.5506653 0.11740225 0.2441513 5.000000
## 7   vbs_pre         F 57 3.617544 0.4419521 0.05853798 0.1172657 4.666667
## 8   vbs_pre         M 26 3.541026 0.4213531 0.08263415 0.1701882 4.266667
##        min
## 1 1.875000
## 2 2.500000
## 3 1.500000
## 4 2.500000
## 5 1.466667
## 6 2.333333
## 7 2.600000
## 8 2.400000
ggplot(data = summ2, aes(x=variable, y=value, fill=gender_rc)) +
  geom_boxplot(aes(lower=value-ci, upper=value+ci, middle=value, ymin=min, ymax=max), stat = "identity") +
  labs(caption = barcaption) + xlab("Mean") + ylab("Variable")

Scatterplots

Barplots and boxplots are good for showing group comparisons, but what if you want to show the relationship between a pair of variables? In this case, scatterplots are your best bet. Like histograms and density plots, scatterplots are also good to use during your analysis as a way to examine the relationships between variables visually and maybe spot any issues before they trip up your analyses.

Basic Scatterplot

As with the other plots, you can create functional scatterplots using R’s base code:

plot(x=data$salg_pre, y=data$salg_post)

You might have a list of variables whose correlations you want to examine (for instance, before running a regression). Rather than run the same command over and over, you can use the code below to produce a grid or matrix of plots:

pairs(data[,2:5], lower.panel = NULL) #the brackets after calling the 'data' dataframe tell the command which columns to use. specifying columns 2 and 4 would just show the pre/post salg data, for instance. the comma within the brackets seems random, but it is required and the command won't work without it. yay, R.

Between-Groups Comparison

But of course, you might also be interested in how the relationship between variables differs according to group membership. To do this, we’ll tweak the ggplot() code we’ve been using throughout the document.

ggplot(data_mf, aes(x=salg_pre, y=salg_post, color=gender_rc)) + 
  geom_point(size=3) +
  labs(caption = "Figure 1: Pre and post science interest and confidence", color = "Gender") + xlab("Pre-Test") + ylab("Post-Test")
## Warning: Removed 59 rows containing missing values (geom_point).

Note that there’s a new argument in the labs() command, ‘color’, which lets us customize our legend. If we had used a different argument in our ggplot() command, such as ‘fill’, we would replace the ‘color’ argument with fill in order to customize this section of our plot.

If you’ve created scatterplots before, you might have run into the issue of overplotting, or having overlapping points. This often happens when you don’t have a lot of unique data points. The sample data used for this document is the average of several variables, which helps prevent this issue, so here’s an example from this page:

You can solve this issue by using the geom_jitter() command in place of geom_point(). This command works identically to geom_point but introduces a bit of jitter when creating the plots, so they don’t land on top of each other:

ggplot(data_mf, aes(x=salg_pre, y=salg_post, color=gender_rc)) + 
  geom_jitter(size=3) +
  labs(caption = "Figure 1: Pre and post science interest and confidence", color = "Gender") + xlab("Pre-Test") + ylab("Post-Test")
## Warning: Removed 59 rows containing missing values (geom_point).

Or you can view the example from the linked page, which shows a more dramatic effect:

The page itself talks about different types of overplotting and presents some other possible solutions, such as using the ‘alpha’ argument or changing the size of your points. Check it out!

Note that using jitter has one major drawback, in that it misrepresents your data. A little bit of jitter can help readers make sense of your plot, but the amount of jitter is customizable using the ‘height’ and ‘width’ arguments in the geom_jitter() command. It’s possible to present a false narrative, even accidentally, by playing with these settings too much, so be cautious.

Word Clouds

Last but not least is qualitative data, for which the simplest visualization is the good, old-fashioned word cloud. There are some websites that will create word clouds for you, but it’s good to keep in mind that decisions are made when creating any visualization, even a word cloud, and it’s important to know what they are lest you misrepresent your data.

I don’t have any qualitative data that I can share for this example, so I’m going to use an ebook of Oliver Twist from Project Gutenberg. On a side note, my head knows that Oliver Twist is about an orphan from London, but in my heart Oliver Twist will always be this little guy:

Importing Text Data into R

Getting text into R similar to importing quantitative data, but cleaning text data is naturally pretty different. I don’t go into too much detail for the example below, but if you’re looking for resources, a lot of this code is cribbed from a few helpful sites:

R isn’t well-suited to true qualitative analysis, but it can do some cool stuff at the word-level that compliments a traditional thematic or interpretive analysis. And of course, you can create pretty pictures to offset all of that text.

In the code below, I import the text file (your qualitative data will need to be in the stripped down .txt format – copy/pasting your data into Notepad is the easiest, if not the most efficient, way to get it in the correct format) and do some data cleaning. I’ve tried to provides comments for the code to make it clear what’s going on.

text <- read_lines("Data/olivertwist.txt", skip = 141) #import the text file. remember to edit this path if you're using your own data. the 'skip' argument drops the first 141 lines from the file, which are just additions from Project Gutenberg and the list of chapters

# there are some weird text artifacts that have crept into the txt file and special characters, so before we do anything else, we're going to get rid of them. some of the other commands later will also replace text, but they don't work well for these little bits of gibberish, so I'm doing that separately.

gibberish <- c("â€\u009d|’s|“|’|â€|“|”|’|—") #each bit betwen the | is a set of characters need to be replaced. if I wanted to replace the words apple, banana, and carrot, I would write "apple|banana|carrot"
text2 <- str_replace_all(string = text, pattern = gibberish, replacement = "") #this does the replacement

docs <- Corpus(VectorSource(text2)) #creates the corpus. if you have multiple files -- e.g., interview data from multiple participants -- you can combine them into a single corpus while still keeping them separate, and even combine your text data with numeric or categorical data (e.g., demographic information)

inspect(head(docs, n = 20)) #lets us preview the corpus. note that I'm using nested commands here to limit the preview (otherwise it would print the entire oliver twist novel as output, which would take up a bit of space). if you want to see your full corpus or as much as R will show, just use the command 'inspect(docs)'
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 20
## 
##  [1]  CHAPTER I.                                                            
##  [2] TREATS OF THE PLACE WHERE OLIVER TWIST WAS BORN AND OF THE             
##  [3] CIRCUMSTANCES ATTENDING HIS BIRTH                                      
##  [4]                                                                        
##  [5]                                                                        
##  [6] Among other public buildings in a certain town, which for many reasons 
##  [7] it will be prudent to refrain from mentioning, and to which I will     
##  [8] assign no fictitious name, there is one anciently common to most towns,
##  [9] great or small: to wit, a workhouse; and in this workhouse was born; on
## [10] a day and date which I need not trouble myself to repeat, inasmuch as  
## [11] it can be of no possible consequence to the reader, in this stage of   
## [12] the business at all events; the item of mortality whose name is        
## [13] prefixed to the head of this chapter.                                  
## [14]                                                                        
## [15] For a long time after it was ushered into this world of sorrow and     
## [16] trouble, by the parish surgeon, it remained a matter of considerable   
## [17] doubt whether the child would survive to bear any name at all; in which
## [18] case it is somewhat more than probable that these memoirs would never  
## [19] have appeared; or, if they had, that being comprised within a couple of
## [20] pages, they would have possessed the inestimable merit of being the

If you’re using your own data, you probably won’t know if any gibberish bits that need to be replaced beforehand. In that case, I would recommend skipping the creation of the gibberish object and the str_replace_all() command, and simply importing your text, creating the corpus, and previewing it. Then you can check the preview for any issues, go back and create the gibberish object and run the str_replace_all() command, re-create your corpus, and preview it again. You can iterate through this process until you’ve removed all the gibberish as needed.

Note that this command shouldn’t be used to remove punctuation, numbers, or other ‘real words’. That will happen in the code below in a much smoother way. This is just for those odd little bits of nonsense that creep in sometimes.

Basic Text Cleaning

This is where we get rid of the stuff we don’t want showing up in our wordcloud. This includes things like punctuation but also filler words. The tm_map() and content_transformer() commands make this process pretty quick.

Note that when you run the tm_map() command, R will spit out a warning message saying that it “drops documents”. In this case, documents refers to words and characters in your corpus, not any other kind of document. This is actually a good thing! We’re using the tm_map() command to remove stuff we want to remove! So this is a useless warning. Thanks, R.

docs <- tm_map(docs, content_transformer(tolower)) #this makes everything lowercase. otherwise the word 'if' and 'If' will be treated as separate entries
docs <- tm_map(docs, removeNumbers) #drops all numbers. if you want to keep numbers in your wordcloud, you should skip this line
docs <- tm_map(docs, removePunctuation) #removes all punctuation
docs <- tm_map(docs, stripWhitespace) #removes extra spaces, like the lines between paragraphs or double spaces after periods
inspect(head(docs, n = 20))
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 20
## 
##  [1]  chapter i                                                            
##  [2] treats of the place where oliver twist was born and of the            
##  [3] circumstances attending his birth                                     
##  [4]                                                                       
##  [5]                                                                       
##  [6] among other public buildings in a certain town which for many reasons 
##  [7] it will be prudent to refrain from mentioning and to which i will     
##  [8] assign no fictitious name there is one anciently common to most towns 
##  [9] great or small to wit a workhouse and in this workhouse was born on   
## [10] a day and date which i need not trouble myself to repeat inasmuch as  
## [11] it can be of no possible consequence to the reader in this stage of   
## [12] the business at all events the item of mortality whose name is        
## [13] prefixed to the head of this chapter                                  
## [14]                                                                       
## [15] for a long time after it was ushered into this world of sorrow and    
## [16] trouble by the parish surgeon it remained a matter of considerable    
## [17] doubt whether the child would survive to bear any name at all in which
## [18] case it is somewhat more than probable that these memoirs would never 
## [19] have appeared or if they had that being comprised within a couple of  
## [20] pages they would have possessed the inestimable merit of being the

Removing Stopwords

One of the central parts in cleaning text is the removal of stopwords. Stopwords are bits of language that can be unmeaningful in certain analysis. For instance, if you’re creating a wordcloud from interview data, you might find the words ‘you’ and ‘know’ are huge in the cloud because the participant uses the phrase ‘you know’ as a verbal tic. Stripping all occurances of ‘you know’ from the corpus can thus ensure that your wordcloud is presenting meaningful words, phrases, and ideas, and not cluttered up with verbal filler.

The tm package, which is what we’re using in this section, has a list of stopwords that you can use.

stopwords("english")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"

Sometimes there are words you’ll want to keep; for instance, some quantitative qualitative analyses look at pronouns and how often they’re used, in which case the first 30 words in the stoplist aren’t ones you’d want to drop. You might also have your own stopwords or phrases, like ‘you know’. In that case, you can create a new object listing your desired stopwords, and then use it in place of the provided list:

mystopwords <- c("chapter","said")

# want to use part of R's list but don't want to retype or copy/paste a bunch of words? we can use some of the code from earlier to drop specified words from R's stoplist:

rstopwords <- stopwords("english")
dropme <- c("up|down|in|out|when|where|why|how")
mystopwords2 <- c(mystopwords, str_replace_all(string = rstopwords, pattern = dropme, replacement = "")) #str_replace_all() replaces the dropme list with blank spaces, and the c() command combines my previous list of stopwords with this new, edited one

docs <- tm_map(docs, removeWords, mystopwords2) #to use your custom stoplist
docs <- tm_map(docs, removeWords, stopwords("english")) #to use the stoplist provided by R
inspect(head(docs, n = 20))
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 20
## 
##  [1]                                                         
##  [2] treats   place  oliver twist  born                      
##  [3] circumstances attending  birth                          
##  [4]                                                         
##  [5]                                                         
##  [6] among  public buildings   certain town   many reasons   
##  [7]  will  prudent  refrain  mentioning     will            
##  [8] assign  fictitious name   one anciently common   towns  
##  [9] great  small  wit  workhouse    workhouse  born         
## [10]  day  date   need  trouble   repeat inasmuch            
## [11]  can    possible consequence   reader   stage           
## [12]  business   events  item  mortality whose name          
## [13] prefixed   head                                         
## [14]                                                         
## [15]   long time    ushered   world  sorrow                  
## [16] trouble   parish surgeon  remained  matter  considerable
## [17] doubt whether  child  survive  bear  name               
## [18] case   somewhat   probable   memoirs  never             
## [19]  appeared       comprised within  couple                
## [20] pages    possessed  inestimable merit

Of course, you can also use both options, if you want to use the provided list and your own items. Just run both lines of code.

Create Wordcloud

Finally, we’ve cleaned our text data of all the extra bits and bobs, and are ready to create the wordcloud. The code below transforms our corpus into a term-document matrix (a special type of object that’s used for text analysis in R). The other lines allow us to look at the word-level data before finally creating our cloud.

dtm <- TermDocumentMatrix(docs) #create term-document matrix
d <- data.frame(word = names(sort(rowSums(as.matrix(dtm)),decreasing=TRUE)),freq=sort(rowSums(as.matrix(dtm)),decreasing=TRUE)) # pulls from the term-document matrix so we can look at our word counts

head(d, 10) #view 10 most used words. you can view the 'd' dataframe for a list of all the words used, sorted by frequency. this can be a good way to stop any extra characters, like the fancy quotes “” that weren't stripped out by the tm_map() command, and go back and remove them and re-run your code. I also noticed that 'said' was the most frequently used word, so I went back and added it to my stoplist
##                word freq
## oliver       oliver  747
## upon           upon  479
## replied     replied  464
## one             one  449
## old             old  444
## man             man  365
## bumble       bumble  363
## sikes         sikes  344
## time           time  319
## gentleman gentleman  308
set.seed(1) #R uses the seed when generating random content, in this case the arrangement of the wordcloud. if you don't like the shape of your cloud, try a new seed
wordcloud(words = d$word, freq = d$freq, min.freq = 5, scale=c(3,.25),
          max.words=150, random.order=T, random.color = F, rot.per=0.35, 
          colors=brewer.pal(8, "Spectral"))

A few arguments within the wordcloud() command that you might want to tweak:

  • min.freq - The minimum number of times a word must be used to appear in the cloud.
  • scale - The first value is the size of the largest words in the cloud, the second value is the size of the smallest words. Setting this to 1,1 will make all words the same size, regardless of how often they’re used.
  • max.words - What it says. Will drop from the bottom, so less frequent words will be cut off if you lower this number. You can look at the number of observations in the ‘d’ dataframe to see how many words are in your data (for Oliver Twist, there are 11,357 unique words)
  • random.order - Determines whether words are plotted by size or not. Can be ‘true’ or ‘false’ (T or F).
  • random.color - Same as above, but with colors.
  • rot.per - How many words are rotated. Rotated words can make the wordcloud prettier but can pose an accessibility issue, so if you want people to really focus on your cloud, setting this to 0 will make it a bit more readable.
  • color - Determines your color palette. This code uses the palettes from the RColorBrewer package. You can preview the different palettes here. The number determines how many levels are in the palette (minimum is 3), and the section in quotes selected the palette. You can also create your own palette by replacing the brewer.pal() command with your list of hex colors (like so: colors=c(“#000000”,“#999999”,“#AAAAAA”)).

Conclusion

You made it to the end!

I hope you’re found this document useful and that the process of creating visualizations hasn’t been too arduous. These are some of the basic and most useful plots and visualizations, but there is a ton of stuff that can be done with the ggplot package, including interactive graphs and widgets. In the workshop, we’ll talk more about some more complex examples and discuss some of the issues in visualizing data. I hope to see you there!