Introduction

In this document, I’ll be going over a series of techniques in making data presentable and readable for audiences. After all, a major portion of data science is ensuring that your intended targets can understand and interpret the analysis without issue. Over the course of this document, we will cover the usage and techniques of the following:

  • Graphing data via GGPlot
  • Graphing data via Plotly
  • Detecting and removing outliers
  • Techniques for exploratory statistics
  • Text Mining
  • Usage of APIs for text analysis

GGPlot

Intro

GGPlot (ggplot2) is a very useful package that allows the graphing of all kinds of data using a number of methods and parameters to customize the presentation of your data. The above tabs will take you to vignettes demonstrating a number of useful graphing techniques.

Barplots

We’ll start with making a dataframe based R’s in-built ToothGrowth data set. This is a dummy dataset that will just serve to illustrate the methods for a barplot. Note that we’re also loading ggplot2, the package we’ll be using. We’ll also load dplyr, a generally useful package for readable code and organization.

library(ggplot2)
library(dplyr)

df <- data.frame(dose = c('D0.5','D1','D2'),
len = c(4.2, 10, 29.5))

df
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5

Next, let’s make a second dataframe. In this one, we’re including a third variable; supplement. VC is vitamin C, and OJ is orange juice.

df2 <- data.frame(supp = rep(c('VC','OJ'), each = 3),
                  dose = rep(c('D0.5','D1','D2'), 2),
                  len = c(6.8,15,33,4.2,10,29.5))

df2
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

Here, we’ll set the appropriate parameters for ggplot’s visuals.

theme_set(
  theme_classic() +
    theme(legend.position = 'top')
)

We’ll start with some basic barplots using the tooth data we’ve set up. We’ll create variable f, which will be our ggplot object. We will then call the geom_col() facet to plot it.

f <- ggplot(df, aes(x=dose,y=len))

f + geom_col()

Now let’s try changing the fill and adding labels to the top.

f + geom_col(fill='darkblue') + 
  geom_text(aes(label=len, vjust=-0.3))

Now let’s add the labels inside the bars.

f + geom_col(fill='darkblue') + 
  geom_text(aes(label=len),vjust=1.6, color='white')

Next, we’ll change the barplot colors by group.

f + geom_col(aes(color=dose), fill='white') +
  scale_color_manual(values=c('blue','gold','red'))

This isn’t very readable, so let’s adjust the fill color.

f + geom_col(aes(fill=dose)) +
  scale_fill_manual(values=c('blue','gold','red'))

Suppose we have a dataset like df2 which has two treatments groups. How do we graph that in a barplot? We can use color=supp to color our two bars separately, then position=position_stack() to place each group atop each other in a stacked barplot.

ggplot(df2, aes(x=dose,y=len)) +
  geom_col(aes(color=supp, fill=supp), position=position_stack()) +
  scale_color_manual(values=c('blue','gold')) +
  scale_fill_manual(values=c('blue','gold'))

What if we want the bars next to each other instead of stacked? Note that we also can save our graphs into a unique variable - for example, p - then call the variable to print the formatted plot.

p <- ggplot(df2, aes(x=dose,y=len)) + 
  geom_col(aes(color=supp,fill=supp), position=position_dodge(0.8), width=0.7) + 
  scale_color_manual(values=c('blue','gold')) + 
  scale_fill_manual(values=c('blue','gold'))

p

Let’s now add labels to the dodged barplot. Note that position_dodge has been slightly adjusted, moving the labels slightly off center.

p + geom_text(
  aes(label=len,group=supp),
  position=position_dodge(0.85),
  vjust=-0.3,size=3.5
)

Alternatively, what if we want labels for our stacked barplots? To do this, we’ll need to use dplyr. Below, we use dplyr to add another column that stores label position values. Note that the %>% symbol denotes “piping down” and is a function of dplyr.

df2 <- df2 %>%
  group_by(dose) %>%
  arrange(dose, desc(supp)) %>%
  mutate(lab_ypos=cumsum(len)-0.5*len)

We’ll now recreate our stacked graphs. We’ll also modify the color of the text to instead be a collection so our gold bars and blue bars have more readable text.

ggplot(df2, aes(x=dose,y=len)) +
  geom_col(aes(fill=supp), width=0.7) +
  geom_text(aes(y=lab_ypos,label=len,group=supp),color=c('black','white','black','white','black','white')) +
  scale_color_manual(values=c('blue','gold')) +
  scale_fill_manual(values=c('blue','gold'))

Boxplots

Let’s now look at boxplots. This time, we’re going to use the full ToothGrowth dataset built into R.

library(ggplot2)

data('ToothGrowth')

Let’s change the dose to a factor, then look at the top of the dataframe.

ToothGrowth$dose <- as.factor(ToothGrowth$dose)

head(ToothGrowth, 4)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5

Let’s set the theme for our plots.

theme_set(
  theme_dark() +
    theme(legend.position='top')
)

And now we’ll begin with a very basic boxplot of dose vs. length.

tg <- ggplot(ToothGrowth, aes(x=dose,y=len)) 
tg + geom_boxplot()

Next, we’ll add points for the means of the data and try out a notched boxplot.

tg + geom_boxplot(notch=TRUE, fill='lightgray') +
  stat_summary(fun=mean, geom='point', shape=18, size=2.5, color='indianred')

It’s also possible to adjust the scale number of variables included, and their order. Below, we’ll cull out the data for dose=1.

tg + geom_boxplot() +
  scale_x_discrete(limits=c('0.5','2'))
## Warning: Removed 20 rows containing missing values (stat_boxplot).

Next, we will adjust the scale to instead go in descending order.

tg + geom_boxplot() +
  scale_x_discrete(limits=c('2','1','0.5'))

Boxplot colors can also be adjusted by group.

tg + geom_boxplot(aes(color=dose)) +
  scale_color_manual(values=c('indianred','blue1','green2'))

What if we want to display our data subset by OJ vs. Vitamin C?

tg2 <- tg + geom_boxplot(aes(fill=supp), position=position_dodge(0.8)) +
  scale_fill_manual(values=c('#E69F00','#999999'))

tg2

If you instead want both plots separate instead of each box being placed adjacent to one another, you can use facet_wrap. We’ll go over facets more in-depth in the Other Plotting Methods section below.

tg2 + facet_wrap(~supp)

Histograms

Before moving forward, we’re going to set a seed so that our code will reproducibly function for demonstration purposes. We will be setting up a dataframe using randomly generated normal data, hence the usage of a seed.

library(dplyr)
library(ggplot2)
set.seed(1234)

wdata = data.frame(
  sex=factor(rep(c('F','M'), each=200)),
  weight=c(rnorm(200,56), rnorm(200,58))
)

head(wdata,4)
##   sex   weight
## 1   F 54.79293
## 2   F 56.27743
## 3   F 57.08444
## 4   F 53.65430

Next, we’ll need summary statistics in order to graph our data. We’ll use dplyr to calculate these.

mu <- wdata %>%
  group_by(sex) %>%
  summarise(grp.mean=mean(weight))

Now, we’ll set our theme for ggplot.

theme_set(
  theme_dark() +
    theme(legend.position='bottom')
)

Next, we’ll create a ggplot object using our dataframe.

a <- ggplot(wdata, aes(x=weight))

a + geom_histogram(bins=30,color='black',fill='grey') +
  geom_vline(aes(xintercept=mean(weight)),
             linetype='dashed', size=0.6)

Next, let’s change the colors so each group is separately identifiable.

a + geom_histogram(aes(color=sex), fill='white', position='identity') +
  scale_color_manual(values=c('#00AFBB','#B7B800'))

Now let’s add a fill to the bars.

a + geom_histogram(aes(color=sex, fill=sex), position='identity') +
  scale_color_manual(values=c('#00AFBB','#B7B800')) +
  scale_fill_manual(values=c('indianred','steelblue'))

What if we want to combine density plots and histograms? Combining plots with GGPlot is as easy as using a ‘+’ sign, actually.

a + geom_histogram(aes(y=stat(density)),
                   color='black', fill='white') +
  geom_density(alpha=0.2, fill='#FF6666')

We can also separate the gender variable to plot two density plots along our histogram.

a + geom_histogram(aes(y=stat(density), color=sex), fill='white', position='identity') +
  geom_density(aes(color=sex),size=1) +
  scale_color_manual(values=c('indianred','steelblue'))

Lineplots

Lineplots are one of the most ubiquitous and intuitive graph types when used in their intended way. To demonstrate how to make them, we’ll create a simple dataframe like we did for the barplots section:

library(ggplot2)

df <- data.frame(dose=c('D0.5','D1','D2'),
                 len=c(4.2,10,29.5))

df
##   dose  len
## 1 D0.5  4.2
## 2   D1 10.0
## 3   D2 29.5

Next we’ll create a second dataframe for plotting by groups.

df2 <- data.frame(supp=rep(c('VC','OJ'),each=3),
                  dose=rep(c('D0.5','D1','D2'), 2),
                  len=c(6.8,15,33,4.2,10,29.5))

df2
##   supp dose  len
## 1   VC D0.5  6.8
## 2   VC   D1 15.0
## 3   VC   D2 33.0
## 4   OJ D0.5  4.2
## 5   OJ   D1 10.0
## 6   OJ   D2 29.5

Now, as usual, we’ll set up GGPlot’s theme.

theme_set(
  theme_gray() +
    theme(legend.position='right')
)

Now let’s get to the actual line plots, starting with basic line plots. First we will build a function to display all the different line types.

generateRLineTypes <- function(){
  oldPar <- par()
  par(font=2, mar=c(0,0,0,0))
  plot(1, pch='',ylim=c(0,6),xlim=c(0,0.7), axes=FALSE, xlab='', ylab='')
  for(i in 0:6) lines(c(0.3,0.7), c(i,i), lty=i, lwd=3)
  text(rep(0.1,6),0:6, labels=c("0.'blank'", "1.'solid'", "2.'dashed'", "3.'dotted'", "4.'dotdash'", "5.'longdash'", "6.'twodash'"))
  par(mar=oldPar$mar, font=oldPar$font)
}

generateRLineTypes()

Now we’ll build our basic line plot.

p <- ggplot(data=df, aes(x=dose, y=len, group=1))

p + geom_line() + geom_point()

Now let’s try modifying the line type and colors.

p + geom_line(linetype='dashed', color='steelblue') +
  geom_point(color='indianred')

Next we’ll try a step graph, which indicates a threshold-type progression.

p + geom_step() + geom_point()

Now let’s move on to making multiple groups, starting with creating a new ggplot object.

p2 <- ggplot(df2, aes(x=dose, y=len, group=supp))

Now we’ll change the line types and point shapes by group.

p2 + geom_line(aes(linetype=supp, color=supp)) + 
  geom_point(aes(shape=supp, color=supp)) +
  scale_color_manual(values=c('darkred','darkblue'))

What about line plots with a numeric x axis? Let’s make a new dataframe with the x-axis set up as pure strings for conversion to numeric data.

df3 <- data.frame(supp=rep(c('VC','OJ'), each=3),
                  dose=rep(c('0.5','1','2'), 2),
                  len=c(6.8,15,33,4.2,10,29.5))

df3
##   supp dose  len
## 1   VC  0.5  6.8
## 2   VC    1 15.0
## 3   VC    2 33.0
## 4   OJ  0.5  4.2
## 5   OJ    1 10.0
## 6   OJ    2 29.5

Now we’ll plot such that both axes are treated as continuous labels.

df3$dose <- as.numeric(as.vector(df3$dose))
ggplot(data=df3, aes(x=dose, y=len, group=supp, color=supp)) +
  geom_line() + geom_point()

Next we’ll look at a line graph with the x-axis as dates. We’ll use the built in economics time series for this example:

head(economics)
## # A tibble: 6 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
## 4 1967-10-01  512. 199311    12.9     4.9     3143
## 5 1967-11-01  517. 199498    12.8     4.7     3066
## 6 1967-12-01  525. 199657    11.8     4.8     3018

And here’s the plot itself:

ggplot(data=economics, aes(x=date, y=pop)) +
  geom_line()

Let’s make some subsets for the data.

ss <- subset(economics, date > as.Date('2006-1-1'))
ggplot(data=ss, aes(x=date, y=pop)) + geom_line()

We can also change the line size contextually based on a secondary variable, such as unemployment.

ggplot(data=economics, aes(x=date, y=pop)) +
  geom_line(aes(size=unemploy/pop))

We can also plot multiple time-series data.

ggplot(economics, aes(x=date)) +
  geom_line(aes(y=psavert), color='darkred') +
  geom_line(aes(y=uempmed), color='steelblue', linetype='twodash')

Lastly, let’s make this into an area plot.

ggplot(economics, aes(x=date)) +
  geom_area(aes(y=psavert), fill='#999999',
            color = '#999999', alpha = 0.5) +
  geom_area(aes(y=uempmed), fill='#E69F00',
            color = '#E69F00', alpha = 0.5)

Dotplots

First, we’ll load up our required packages as usual and set our GGPlot theme.

library(ggplot2)

theme_set(
  theme_light() +
    theme(legend.position='bottom')
)

Now we’ll initiate a ggplot object called TG using the ToothGrowth dataset. Note that we begin by importing our data, then setting dose to be a factor.

data('ToothGrowth')
ToothGrowth$dose <- as.factor(ToothGrowth$dose)

tg <- ggplot(ToothGrowth, aes(x=dose,y=len))

We’ll now create a dotplot with a summary statistic.

tg + geom_dotplot(binaxis='y', stackdir='center', fill='lightgray') +
  stat_summary(fun=mean, fun.args=list(mult=1))

Let’s plot both a dotplot and boxplot together.

tg + geom_boxplot(width=0.5) +
  geom_dotplot(binaxis='y', stackdir='center',fill='lightgray')

Let’s now add a violin plot to our dotplot, which helps visualize normalness of data.

tg + geom_violin(trim=FALSE) +
  geom_dotplot(binaxis='y', stackdir='center', fill='#999999') +
  stat_summary(fun=mean, fun.args=list(mult=1))

Now let’s try creating a dotplot with multiple groups specified by color.

tg + geom_boxplot(width=0.5) +
  geom_dotplot(aes(fill=supp),binaxis='y', stackdir='center') +
  scale_fill_manual(values=c('indianred','lightblue1'))

And finally, just as we could split the bar graphs into separate bars by variable, we can also split our data into two separate histogram/dotplots.

tg + geom_boxplot(aes(color=supp), width=0.5, position=position_dodge(0.8)) +
  geom_dotplot(aes(fill=supp, color=supp), binaxis='y', stackdir='center', dotsize=0.8, position=position_dodge(0.8)) +
  scale_fill_manual(values=c('orange','green')) +
  scale_color_manual(values=c('orange','green'))

Density Plots

Density plots serve as good alternatives to histograms. First, we’ll establsh our dataset from random normal numbers placed into a dataframe.

library(dplyr)
library(ggplot2)
set.seed(376)

wdata = data.frame(
  sex=factor(rep(c('F','M'), each=200)),
  weight=c(rnorm(200,55),rnorm(200,58))
)

Next, we’ll generate summary statistics for plotting.

mu <- wdata %>%
  group_by(sex) %>%
  summarise(grp.mean=mean(weight))

Now let’s set our theme for ggplot.

theme_set(
  theme_classic() +
    theme(legend.position='left')
)

Next we’ll do the basic plotting function, including creating a ggplot object.

d <- ggplot(wdata, aes(x=weight))

d + geom_density() +
  geom_vline(aes(xintercept=mean(weight)), linetype='dashed')

Now we’ll change the y axis to counts instead of density.

d + geom_density(aes(y=stat(count)), fill = 'steelblue', alpha=0.5) +
  geom_vline(aes(xintercept=mean(weight)), linetype='dashed')

We can also split the single line between our sex variable.

d + geom_density(aes(color=sex)) +
  scale_color_manual(values=c('red','blue'))

Now let’s add a fill on the density plots, as well as lines to mark the means of both groups.

d + geom_density(aes(fill=sex), alpha = 0.5) +
  geom_vline(aes(xintercept=grp.mean, color=sex), data = mu, linetype='dashed') +
  scale_color_manual(values=c('red','blue')) +
  scale_fill_manual(values=c('red','blue'))

Ridge Plots

First, we’ll load our required packages below. Note that we’re loading specifically ggridges for this, as well as viridis. Viridis expands our options for colors and interfaces well with ggplot and ggridges.

library(ggplot2)
library(ggridges)
library(viridis)
library(tidyr)

Next, let’s do a simple ridge plot for density using temperature by month from the airquality dataset built into R.

air <- ggplot(airquality) + aes(Temp, Month, group=Month) + geom_density_ridges()

air

Let’s add some pizazz now. Viridis will be used in the fill and scale_fill_viridis layers below.

ggplot(airquality) + aes(Temp, Month, group=Month, fill=..x..) +
  geom_density_ridges_gradient() +
  scale_fill_viridis(option='C', name='Temp')

Viridis’ color gradients and scale give us much more readable data. Ridge plotting may not be the best for this data, but it serves as a good template for the sake of example.

The last thing we’ll do is create a facet plot for our data.

airquality %>%
gather(key='Measurement', value='value', Ozone, Solar.R, Wind, Temp) %>%
  ggplot() + aes(value, Month, group=Month) + #Here, we're saying we want value & month as our axes, then grouping the data by month
  geom_density_ridges() +
  facet_wrap(~ Measurement, scales = 'free')

ECDF Plots

ECDF stands for Empirical Commutative Distribution. This method is a useful alternative to a histogram as it plots the full range of a dataset without need for extra parameters.

First things first, we’ll perform an empirical commutative distribution function. This function reports any given number percentile of individuals that are above or below that threshold. First, we’ll create a dummy dataframe from randomized data:

set.seed(1234)

wdata=data.frame(
  sex=factor(rep(c('F','M'), each=200)),
  weight=c(rnorm(200,55), rnorm(200,58))
)

head(wdata, 5)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
## 5   F 55.42912

Next, we’ll load our plotting package and set up our theme.

library(ggplot2)

theme_set(
  theme_gray() +
    theme(legend.position='bottom')
)

Now we’ll create our ECDF plot.

ggplot(wdata,
       aes(x=weight)) +
  stat_ecdf(aes(color=sex, linetype=sex),
            geom='step', size=1.5) +
  scale_color_manual(values=c('magenta','forestgreen')) +
  labs(y='weight')

QQ Plots

And now we’ll take a look at QQ Plots. These are used to determine if given data follows a normal distribution.

First, we’ll install the required packages and set our seed, then randomly generate our dummy data.

library(ggpubr)
library(ggplot2)
library(mnonr)

set.seed(1234)

wdata=data.frame(
  sex=factor(rep(c('F','M'), each=200)),
  weight=c(rnorm(200,55), rnorm(200,58))
)

head(wdata, 5)
##   sex   weight
## 1   F 53.79293
## 2   F 55.27743
## 3   F 56.08444
## 4   F 52.65430
## 5   F 55.42912
theme_set(
  theme_dark() +
    theme(legend.position='top')
)

Now we’ll create the QQ plot itself.

ggplot(wdata,
       aes(sample=weight)) +
  stat_qq(aes(color=sex)) +
  scale_color_manual(values=c('#0073C2FF','#FC4E07')) +
  labs(y='weight')

We can also generate cones to give a more easily apparent trend to the data.

ggqqplot(wdata, x='weight',
         color='sex',
         palettes=c('#0073C2FF','#FC4E07'),
         ggtheme=theme_pubclean())

What would non-normal data look like on a QQ plot? To demonstrate this, we’ll use the mnonr package to create some random data.

data2 <- mnonr::mnonr(n=1000, p=2, ms=3, mk=61, Sigma=matrix(c(1, 0.5, 0.5, 1), 2, 2), initial=NULL)

data2 <- as.data.frame(data2)

Next, we’ll plot that non-normal data.

ggplot(data2,
       aes(sample=V1)) +
  stat_qq()

And here’s that same data using the shading we demonstrated earlier.

ggqqplot(data2, x='V1',
         palette='forestgreen',
         ggtheme=theme_pubclean())

Plotly

Intro

GGPlot isn’t the only graphing package available in R, and Plotly has the unique ability to produce interactive plots. In addition, these interactive plots don’t even need to be 2D; the second section will demonstrate how to create an interactive 3D plot.

Line Plots

First let’s load our required package and use a built-in data set of orange tree ages and their circumference.

library(plotly)

Orange <- as.data.frame(Orange)

Next, let’s create a basic scatter plot of our data.

plot_ly(data = Orange, x=~age, y=~circumference)

If you mouse-over the plot above, you can get the exact values of each point. We can do more, though; let’s add some more info. We’ll have each color represent a different tree ID, and scale each point by age. Additionally, each point can be hovered over for detailed information.

plot_ly(data = Orange, x=~age, y=~circumference,
        color = ~Tree, size = ~age,
        text = ~paste('Tree ID:', Tree, '<br>Age:', age, 'Circ:', circumference)
)

Next, let’s create a random distribution and add it to our dataframe.

trace_1 <- rnorm(35, mean=120, sd=10)
new_data <- data.frame(Orange, trace_1)

We’ll use the random numbers as lines on the graph to demonstrate how Plotly can be used to display multiple different variables and graph types.

plot_ly(data=new_data, x=~age, y=~circumference, color=~Tree, size=~age, text=~paste('Tree ID:', Tree, '<br>Age:', age, 'Circ:', circumference)) %>%
  add_trace(y=~trace_1, mode='lines') %>%
  add_trace(y=~circumference, mode='markers')

Now let’s create a graph with the option of showing as a scatter or line, and add labels.

plot_ly(data=Orange, x=~age, y=~circumference, color=~Tree, size=~circumference, text=~paste('Tree ID:', Tree, '<br>Age:', age, 'Circ:', circumference)) %>%
  add_trace(y=~trace_1, mode='lines') %>%
  add_trace(y=~circumference, mode='markers') %>%
  layout(
    title='Plot of Orange Data with Switchable Trace',
    updatemenus = list(
      list(
        type='downdrop',
        y=0.8,
        buttons=list(
          list(method='restyle',
               args=list('mode', 'markers'),
               label='Marker'
          ),
          list(method='restyle',
               args=list('mode','lines'),
               label='Lines'
          )
        )
      )
    )
  )

Plotly 3D

First, we’ll load our required packages and create a random 3D matrix.

library(plotly)

d <- data.frame(
  x <- seq(1,10,by=0.5),
  y <- seq(1,10,by=0.5)
)

z <- matrix(rnorm(length(d$x)*length(d$y)), nrow=length(d$x), ncol=length(d$y))

Now let’s plot our 3D data.

plot_ly(d, x=~x, y=~y, z=~z) %>%
  add_surface()

You can click and drag the above plot to spin the 3D plot, allowing for excellent interactivity with the user. Next, let’s add some topographical lines to our plot to add yet more information.

plot_ly(d, x=~x, y=~y, z=~z) %>%
  add_surface(contours=list(z=list(show=TRUE, usecolormap=TRUE, highlightcolor='indianred', project=list(z=TRUE)
  )
 )
)

Now let’s try out a 3D scatter plot using Longley’s Economic Regression Data.

plot_ly(longley, x=~GNP, y=~Population, z=~Employed, marker=list(color=~GNP)) %>%
  add_markers()

Other Plotting Methods

Intro

Aside from GGPlot and Plotly, there are more useful packages for niche applications, and there are also a number of general practices that would apply to any graph. Error bars are important for an array of various plots and datasets for appropriate presentation, facet plots can separate data for clarity, and heatmaps are frequently found in genetics research. Each of these is outlined in the above tabs.

Error Bars

As always, we’ll begin with loading our required libraries and setting up our dataset. In this case, we’ll use the built-in ToothGrowth dataset.

library(ggplot2)
library(dplyr)
library(plotrix)

theme_set(
  theme_dark() +
    theme(legend.position = 'bottom')
)

df <- ToothGrowth
df$dose <- as.factor(df$dose)

Now we’ll use dplyr to prepare our data for manipulation.

df.summary <- df %>%
  group_by(dose) %>%
  summarise(
    sd = sd(len, na.rm=TRUE),
    stderr = std.error(len, na.rm=TRUE),
    len = mean(len)
  )

df.summary
## # A tibble: 3 × 4
##   dose     sd stderr   len
##   <fct> <dbl>  <dbl> <dbl>
## 1 0.5    4.50  1.01   10.6
## 2 1      4.42  0.987  19.7
## 3 2      3.77  0.844  26.1

Now let’s look at some key functions;

  1. geom_crossbar() - makes hollow bars with the middle indicated by a horizontal line
  2. geom_errorbar() - makes vertical error bars
  3. geom_errorbarh() - makes horizontal error bars
  4. geom_linerange() - draws a vertical line that represents an interval (for example, drawing a significance line for genetic analysis)
  5. geom_pointrange() - draws a vertical line representing an interval with a point in the middle

We’ll demonstrate these, but first must create our ggplot object.

tg <- ggplot(
  df.summary,
  aes(x=dose,y=len,ymin=len-sd, ymax=len+sd)
)

First, let’s test out basic error bars.

tg + geom_pointrange()

tg + geom_errorbar(width=0.2) +
  geom_point(size=1.5)

Now we’ll create some horizontal error bars by manipulating our graph.

ggplot(df.summary,
       aes(x=len, y=dose, xmin=len-sd, xmax=len+sd)) +
  geom_point() +
  geom_errorbarh(height=0.2)

This gives us a good idea of how error bars can be placed on the horizontal axis like this.

Next, we’ll look at adding jitter points (actual measurements) to our data.

ggplot(df,
       aes(dose,len)) +
  geom_jitter(position=position_jitter(0.2), color='white') +
  geom_pointrange(aes(ymin=len-sd, ymax=len+sd), data=df.summary)

Next, let’s try placing error bars on a violin plot.

ggplot(df,
       aes(dose,len)) +
  geom_violin(color='white', trim=FALSE) +
  geom_pointrange(aes(ymin=len-sd, ymax=len+sd), data=df.summary) +
  geom_errorbar(aes(ymin=len-sd, ymax=len+sd), data=df.summary, width=0.2)

How about line graphs? Error bars can be added to those like so:

ggplot(df.summary,
       aes(dose,len)) +
  geom_line(aes(group=1)) + #always specify this when you have 1 line
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), width=0.2) +
  geom_point(size=2)

Now we’ll make a bar graph with half error bars.

ggplot(df.summary,
       aes(dose,len)) +
  geom_col(fill='white', color='black') +
  geom_errorbar(aes(ymin=len, ymax=len+stderr), width=0.2)

Alternatively, with full error bars;

ggplot(df.summary,
       aes(dose,len)) +
  geom_col(fill='white', color='black') +
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), width=0.2)

As you can see, the main difference is how we specified the geom_errorbar argument for ymin and ymax; by not including -stderr for the ymin, the lower bar completely overlaps with the top edge of each bar and is invisible.

Let’s next try adding jitter points to some line plots. To do so, we’ll need to use the original dataframe for the jitter plot, and the summary df for the geom layers.

ggplot(df,
       aes(dose,len)) +
  geom_jitter(position=position_jitter(0.2), color='white') +
  geom_line(aes(group=1), data=df.summary) +
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), data=df.summary, width=0.2) +
  geom_point(data=df.summary, size=2)

We can even add jitter points to a barplot.

ggplot(df,
       aes(dose,len)) +
  geom_col(data=df.summary, fill='steelblue', color='black') +
  geom_jitter(position=position_jitter(0.2), color='white') +
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr),
                data=df.summary, width=0.2) +
  geom_point(data=df.summary, size=2)

What if we wanted separate error bars for different groups of our data? (OJ vs. VC) We’ll start by making a new summary object that groups our data by dose and supplement.

df.summary2 <- df %>%
  group_by(dose, supp) %>%
  summarise(
    sd=sd(len),
    stderr=std.error(len),
    len=mean(len)
  )
## `summarise()` has grouped output by 'dose'. You can override using the
## `.groups` argument.
df.summary2
## # A tibble: 6 × 5
## # Groups:   dose [3]
##   dose  supp     sd stderr   len
##   <fct> <fct> <dbl>  <dbl> <dbl>
## 1 0.5   OJ     4.46  1.41  13.2 
## 2 0.5   VC     2.75  0.869  7.98
## 3 1     OJ     3.91  1.24  22.7 
## 4 1     VC     2.52  0.795 16.8 
## 5 2     OJ     2.66  0.840 26.1 
## 6 2     VC     4.80  1.52  26.1

Next, we’ll plot our data.

ggplot(df.summary2,
       aes(dose,len)) +
  geom_pointrange(
    aes(ymin=len-stderr, ymax=len+stderr, color=supp), position=position_dodge(0.3)) +
  scale_color_manual(values=c('indianred','lightblue')) +
    geom_errorbar(
    aes(ymin=len-stderr, ymax=len+stderr, color=supp, group=supp),
    width=0.2, position=position_dodge(0.3))

Let’s try a line plot with multiple error bars.

ggplot(df.summary2,
       aes(dose,len)) +
  geom_line(aes(linetype=supp, group=supp)) +
  geom_point() +
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr, group=supp), width=0.2)

And the same with a bar plot:

ggplot(df.summary2,
       aes(dose,len)) +
  geom_col(aes(fill=supp), position=position_dodge(0.8), width=0.7) +
  geom_errorbar(
    aes(ymin=len-stderr, ymax=len+stderr, group=supp),
    width=0.2, position=position_dodge(0.8)) +
  scale_fill_manual(values=c('indianred','lightblue'))

Next, let’s try adding some jitter points to a line graph.

ggplot(df,
       aes(dose, len, color=supp)) +
  geom_jitter(position=position_dodge(0.2)) +
  geom_line(aes(group=supp), data=df.summary2) +
  geom_point() +
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr, group=supp), data=df.summary2, width=0.2)

And finally, we can also add jitter points to a bar graph to further illustrate the shape of our data.

ggplot(df,
       aes(dose, len, color=supp)) +
  geom_col(data=df.summary2, position=position_dodge(0.8), width=0.7, fill='black') +
  geom_jitter(position=position_jitterdodge(jitter.width=0.2, dodge.width=0.8)) +
  geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), data=df.summary2, width=0.2, position=position_dodge(0.8)) +
  scale_color_manual(values=c('gold','white')) +
  theme(legend.position='top')

Facet Plots

Facet plots can be used for a number of different utilities, but we’ll start by showing how to incorporate two separate datasets into the same figure.

library(ggpubr)
library(ggplot2)

theme_set(
  theme_dark() +
    theme(legend.position='top')
)

First, we’ll make a nice boxplot.

df <- ToothGrowth
df$dose <- as.factor(df$dose)

p <- ggplot(df,
            aes(x=dose,y=len)) +
  geom_boxplot(aes(fill=supp), position=position_dodge(0.9)) +
  scale_fill_manual(values=c('forestgreen','blue'))

p

Now let’s take a look at the ggplot facet function.

p +
  facet_grid(rows=vars(supp))

Next, let’s use facets on data with multiple variables.

p +
  facet_grid(rows=vars(dose), cols=vars(supp))

Let’s take a quick look at the facet_wrap function, which can place facets side-by-side.

p +
  facet_wrap(vars(dose), ncol=2)

Now how would we go about combining multiple plots? We can use ggarrange(). We’ll start by making some basic plots.

my3cols <- c('forestgreen','blue','magenta') #This defines our color palette
ToothGrowth$dose <- as.factor(ToothGrowth$dose)

p2 <- ggplot(ToothGrowth,
             aes(x=dose, y=len))
bxp <- p2 + 
  geom_boxplot(aes(color=dose)) +
  scale_color_manual(values=my3cols)

Now we’ll do a dotplot.

dp <- p2 + geom_dotplot(aes(color=dose, fill=dose),
                        binaxis='y', stackdir='center') +
  scale_color_manual(values=my3cols) +
  scale_fill_manual(values=my3cols)

And lastly, we’ll create a line plot.

lp <- ggplot(economics, aes(x=date, y=psavert)) + #we're mix-and-matching datasets just for visual comparison's sake
  geom_line(color='indianred')

Finally, we can put the whole figure together.

figure <- ggarrange(bxp, dp, lp, labels=c('A','B','C'), ncol=2, nrow=2)

figure

This is good, but it can look better. Let’s rearrange so that our line plot is placed on top and stretched the full width of our figure.

figure2 <- ggarrange(
  lp,
  ggarrange(bxp, dp, ncol=2, labels=c('B','C')),
  nrow=2,
  labels='A')

figure2

This looks pretty good, but there’s the issue of B and C having redundant labels. Let’s fix that.

figure3 <- ggarrange(
  lp,
  ggarrange(bxp, dp, ncol=2, labels=c('B','C'), common.legend=TRUE, legend='bottom'),
  nrow=2,
  labels='A')

figure3

We should export our plot.

ggexport(figure3, filename='facetfigure.pdf')

You can also export multiple plots to a pdf.

ggexport(bxp, dp, lp, filename='multi.pdf')

Lastly, we can export to pdf with multiple pages and multiple columns.

ggexport(bxp, dp, lp, p, filename='test2.pdf', nrow=2, ncol=1)

Heatmaps

Let’s get started with heatmaps. We’ll be installing and loading our package, then getting our data together. Our dataset is a time-series dataset, so we’ll have to do some transformations to it, then re-add our labels.

library(heatmap3)

data <- ldeaths

data2 <- do.call(cbind, split(data, cycle(data))) #This transforms the data into a dataframe-like structure, but we lose our labels
dimnames(data2) <- dimnames(.preformat.ts(data)) #This adds the labels back to our data

Now we’ll generate a basic heatmap.

heatmap(data2)

We can also remove the clustering from the upper version, organizing it in label order.

heatmap(data2, Rowv=NA, Colv=NA)

Next, we’ll mess around with the colors. We’ll use the rainbow function to generate a pallette.

rc <- rainbow(nrow(data2), start=0, end=0.3)
cc <- rainbow(ncol(data2), start=0, end=0.3)

Now we need to apply our color selections to our heatmap. The various colors denote how closely related the different columns are.

heatmap(data2, ColSideColors=cc)

We can also use the package RColorBrewer to access more options for our heatmap’s colors.

library(RColorBrewer)
heatmap(data2, ColSideColors=cc,
        col = colorRampPalette(brewer.pal(8, 'PiYG'))(25))

There are more aspects that we can customize. We can add a small histogram for each column that is rendered to the top left. It displays the counts of each color, and will be done using the gplot package, not to be confused with ggplot2.

library(gplots)

heatmap.2(data2, ColSideColors=cc,
          col=colorRampPalette(brewer.pal(8, 'PiYG'))(25))

Outliers

Intro

Outliers can be very harmful to any data analysis, and for more reasons than simply skewing results. In R and many of its packages, a completely missing value can cause errors, or even make functions fail entirely. Addressing missing values, finding outliers, and analyzing covariance is important in any solid data analysis.

Missing Values

If you encounter unusual values in your data and want to move onto the rest of your analaysis, you have two options:

  • Drop the entire row with the strange values
  • Replace unusual values with missing values (N/A)

Below, we’ll try out method 1 and filter the diamonds dataset by removing every value of y unless it is between 3 and 20. I also added filters for x and z to remove similar outliers in diamonds2.5.

library(dplyr)
library(ggplot2)

diamonds <- diamonds

diamonds2 <- diamonds %>%
  filter(between(y,3,20))

diamonds2.5 <- diamonds %>%
  filter(between(y,3,20), between(x,3,11), between(z,2,6))

This way of filtering data isn’t always preferable as one bad measurement doesn’t necessarily mean we should toss the entire row; other measurements still may be useful.

Instead, you can use the second method: replacing unusual values with missing values (N/A)

Below, we’ll mutate the diamonds dataset to change that set of y values we filtered (below 3 and above 20) into NA values, or missing values. diamonds3.5 does the same with the x and z values.

diamonds3 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

diamonds3.5 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y), x = ifelse(x < 3 | x > 11, NA, x), z = ifelse(z < 2 | z > 6, NA, z))

Like R, ggplot2 works under the assumption that missing values should still be accounted for and considered in analysis.

ggplot(data = diamonds3, mapping = aes(x=x, y=y)) +
         geom_point()
## Warning: Removed 9 rows containing missing values (geom_point).

As you can see, ggplot alerts you that it has removed rows containing missing values.

To supress that warning, you can use na.rm = TRUE, as demonstrated below.

ggplot(data = diamonds3, mapping= aes(x=x,y=y)) +
  geom_point(na.rm = TRUE)

Sometimes, we want to understand what makes observations with missing values different from recorded values; perhaps recorded values could allow you to infer why data is missing.

For example, in the NYCflights13 dataset, missing values from departure time (dep_time) indicate a canceled flight.

You may want to compare scheduled departure times for cancelled and non-ccancelled flights:

library(nycflights13)

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dept_time = sched_hour + sched_min / 60
  ) %>%
  ggplot(mapping=aes(sched_dep_time)) +
  geom_freqpoly(mapping=aes(color=cancelled), binwidth=1/4)

Identifying Outliers

Let’s say we want to find outliers from a dataset. This section will focus on that, using a dataset on air quality that I stored locally.

We’ll start out like always, loading our required libraries and dataset:

library(outliers)
library(ggplot2)
library(readxl)

Air_data <- read_excel('~/School/Fall 2022/Bioinf/R Files/AirQualityUCI.xlsx')

Let’s create a function using the grubbs test to identify all outliers. The grubbs test identifies outliers in a univariate dataset that is presumed to come from a normal distribution. Below, I’ve commented in-line with the code as each section is worth providing details on.

grubbs.flag <- function(x) {
  #We'll start by creating a variable called outliers and save nothing in it; we'll add to it as we identify outliers
  outliers <- NULL
  #We'll also create a variable called test to identify which univariate we're testing
  test <- x
  #Now, using the outliers package, we'll use grubbs.test to find outliers in our variable
  grubbs.result <- grubbs.test(test)
  #Let's get the p-values of all tested variables
  pv <- grubbs.result$p.value
  #Now we'll search through those p-values for ones that are outside of 0.05
  while(pv<0.05) {outliers <- c(outliers, as.numeric(strsplit(grubbs.result$alternative, ' ')[[1]][3]))
  #This says that anything with a p-value greater than 0.05 is added to our empty outliers vector
  test <- x[!x %in% outliers]
  #This finds everything in our data that is not significant
  #Next, we'll run the grubbs test again without the outliers
  grubbs.result <- grubbs.test(test)
  #Then finally save the new p-values from the test
  pv <- grubbs.result$p.value
  }
  return(data.frame(x=x, outliers = (x %in% outliers)))
}

Now let’s run our function.

identified_outliers <- grubbs.flag(Air_data$AH)

Now we can create a histogram showing said outliers

ggplot(grubbs.flag(Air_data$AH), aes(x=Air_data$AH, color=outliers, fill=outliers))+
  geom_histogram(binwidth=diff(range(Air_data$AH))/30)+
  theme_bw()

Covariance

Covariance is the idea that two variables’ changes are tied to one another, and finding the presence of such a thing can provide incredibly valuable insight into the trends of many datasets. We’ll start with categorical data, specifically with diamonds, to help illustrate this. The first plot we’ll use is GGPlot’s frequency polygon diagram:

library(ggplot2)

ggplot(data = diamonds, mapping=aes(x=price)) +
  geom_freqpoly(mapping=aes(color=cut), binwidth=500)

Though this does a decent job at illustrating relative amounts of gemstone cuts, it’s hard to really estimate the difference of distribution due to the differing counts.

Let’s look at the counts of cut grade alone, without value:

ggplot(diamonds)+
  geom_bar(mapping=aes(x=cut))

To make the comparison easier, we’ll swap the display on the y-axis. Instead of displaying count, we’ll display density: the same counts but standardized so the area under the curve is one.

ggplot(data=diamonds, mapping=aes(x=price,y=..density..))+
  geom_freqpoly(mapping=aes(color=cut), binwidth=500)

It would appear that fair diamonds (the lowest cut quality) have the highest average price. Perhaps this is due to frequency polygons being a little hard to interpret.

An alternative is the boxplot. A boxplot is a type of visual shorthand for a distribution of values.

ggplot(data=diamonds, mapping=aes(x=cut, y=price))+
  geom_boxplot()

So here, we see much less information about the distribution, but the boxplots are more compact and can be more easily compared with each other (clearer averages and whatnot)

This supports the counterintuitive finding that better quality diamonds are cheaper on average.

Let’s look at some car data next. We’ll plot car class against their highway miles per gallon.

ggplot(data = mpg, mapping=aes(x=class, y=hwy))+
  geom_boxplot()

It’s a bit messy, so let’s order them by ascending average highway mpg

ggplot(data=mpg)+
  geom_boxplot(mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy))

If you have long variable names, you can switch the axis and flip it 90 degrees.

ggplot(data=mpg)+
  geom_boxplot(mapping=aes(x=reorder(class,hwy, FUN=median), y=hwy))+
  coord_flip()

To visualize the correlation between two continuous variables, we can use a scatter plot.

ggplot(data=diamonds)+
  geom_point(mapping=aes(x=carat,y=price))

Scatterplots may be less useful as the size of your dataset grows, because we get something called overplot. We can fix this using an aesthetic option called alpha.

ggplot(data=diamonds)+
  geom_point(mapping=aes(x=carat, y=price), alpha=5/100)

Now, we can see that the points sort of ‘shade’ locations with large numbers of points, so we can see concentrations more clearly than a flat scatter graph.

Exploratory Statistics

Intro

Data can often comee to you in a form that requires exploration and experimentation to find ways to best draw conclusions, then display those conclusions. In this section, we’ll walk through a bunch of different general techniques that one might take in exploring data.

Exploring Data in General

First, we’ll load required libraries and set up some data. RCurl allows us to load data from remote sites, and our data is being sourced from nytimes’ covid-19 college data.

library(RCurl)
library(dplyr)

site <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv"

College_Data <- read.csv(site)

Next, we’ll use the str function to show the structure of our data.

str(College_Data)
## 'data.frame':    1948 obs. of  9 variables:
##  $ date      : chr  "2021-05-26" "2021-05-26" "2021-05-26" "2021-05-26" ...
##  $ state     : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ county    : chr  "Madison" "Montgomery" "Limestone" "Lee" ...
##  $ city      : chr  "Huntsville" "Montgomery" "Athens" "Auburn" ...
##  $ ipeds_id  : chr  "100654" "100724" "100812" "100858" ...
##  $ college   : chr  "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ...
##  $ cases     : int  41 2 45 2742 220 4 263 137 49 76 ...
##  $ cases_2021: int  NA NA 10 567 80 NA 49 53 10 35 ...
##  $ notes     : chr  "" "" "" "" ...

What if we wanted to arrange our dataset alphabetically by college?

alphabetical <- College_Data %>%
  arrange(College_Data$college)

The glimpse package also gives us another way to preview our data.

glimpse(College_Data)
## Rows: 1,948
## Columns: 9
## $ date       <chr> "2021-05-26", "2021-05-26", "2021-05-26", "2021-05-26", "20…
## $ state      <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
## $ county     <chr> "Madison", "Montgomery", "Limestone", "Lee", "Montgomery", …
## $ city       <chr> "Huntsville", "Montgomery", "Athens", "Auburn", "Montgomery…
## $ ipeds_id   <chr> "100654", "100724", "100812", "100858", "100830", "102429",…
## $ college    <chr> "Alabama A&M University", "Alabama State University", "Athe…
## $ cases      <int> 41, 2, 45, 2742, 220, 4, 263, 137, 49, 76, 67, 0, 229, 19, …
## $ cases_2021 <int> NA, NA, 10, 567, 80, NA, 49, 53, 10, 35, 5, NA, 10, NA, 19,…
## $ notes      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…

We can subset our data with select()

College_Cases <- select(College_Data, college, cases)

We can also filter or subset using the filter() function.

Louisiana_Cases <- filter(College_Data, state == 'Louisiana')

Let’s just filter out for a small number of states.

South_Cases <- filter(College_Data, state == 'Louisiana' | state == 'Texas' | state == 'Arkansas' | state == 'Mississippi')

How about next, we look at some time series data? Starting with loading the packages and data:

library(lubridate)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(scales)

state_site <- 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'

State_Data <- read.csv(state_site)

Let’s make a group_by object using the state column.

State_Cases <- group_by(State_Data, state)

class(State_Cases)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

How many measurements were made by state? This gives us an idea of when states started reporting.

Days_since_first_report <- tally(State_Cases)

Let’s next do some visualization of our data, but we’ll start with definitions.

Data - of course is the stuff we want to visualize

Layer - a component made of geometric elements and requisite statistical info. Includes geometric objects which represent the plot

Scales - used to map values in the data space that are used for creation of meaning (color, size, shape, etc.)

Coordinate system - describes how a data point is mapped relative to other data on the graphic

Faceting - how data can be broken up into subsets and diplayed in multiple formats or in different groups

Theme - controls the finer points of the display, such as font size and background color

Let’s look at a way to very simply display a dataframe in a clear way:

options(repr.plot.width=6, repr.plot.height=6)

class(College_Data)
## [1] "data.frame"
head(College_Data)
##         date   state     county       city ipeds_id
## 1 2021-05-26 Alabama    Madison Huntsville   100654
## 2 2021-05-26 Alabama Montgomery Montgomery   100724
## 3 2021-05-26 Alabama  Limestone     Athens   100812
## 4 2021-05-26 Alabama        Lee     Auburn   100858
## 5 2021-05-26 Alabama Montgomery Montgomery   100830
## 6 2021-05-26 Alabama     Walker     Jasper   102429
##                           college cases cases_2021 notes
## 1          Alabama A&M University    41         NA      
## 2        Alabama State University     2         NA      
## 3         Athens State University    45         10      
## 4               Auburn University  2742        567      
## 5 Auburn University at Montgomery   220         80      
## 6  Bevill State Community College     4         NA
summary(College_Data)
##      date              state              county              city          
##  Length:1948        Length:1948        Length:1948        Length:1948       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    ipeds_id           college              cases          cases_2021    
##  Length:1948        Length:1948        Min.   :   0.0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:  32.0   1st Qu.:  23.0  
##  Mode  :character   Mode  :character   Median : 114.5   Median :  65.0  
##                                        Mean   : 363.5   Mean   : 168.1  
##                                        3rd Qu.: 303.0   3rd Qu.: 159.0  
##                                        Max.   :9914.0   Max.   :3158.0  
##                                                         NA's   :337     
##     notes          
##  Length:1948       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Let’s look at a different dataset in the same way:

iris <- as.data.frame(iris)

class(iris)
## [1] "data.frame"
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Let’s create a scatter plot of the College Data.

ggplot(data=College_Data, aes(x=cases, y=cases_2021))+
  geom_point()+
  theme_minimal()

Now let’s look at the iris dataset.

ggplot(data=iris, aes(x=Sepal.Width, y=Sepal.Length))+
  geom_point()+
  theme_minimal()

Let’s color-coordinate our college data

ggplot(data=College_Data, aes(x=cases, y=cases_2021, color=state))+
  geom_point()+
  theme_minimal()

And let’s do the same for the iris data

ggplot(data=iris, aes(x=Sepal.Width, y=Sepal.Length, color=Species))+
  geom_point()+
  theme_minimal()

We’ll next create a simple histogram using our Louisiana Case data

hist(Louisiana_Cases$cases, freq=NULL, density=NULL, breaks=10, xlab='Total Cases', ylab='Frequency', main='Total College Covid-19 Infections (Louisiana)')

And the same for the iris data:

hist(iris$Sepal.Width, freq=NULL, density=NULL, breaks=10, xlab='Sepal Width', ylab='Frequency', main='Iris Sepal Width')

Next let’s get fancy. We’ll set up the histogram to color-code based on the county/parish of Louisiana, stacking them atop one another in a fashion similar to a stacked bar graph.

histogram_college <- ggplot(data=Louisiana_Cases, aes(x=cases))

histogram_college + geom_histogram(binwidth=100, color='black', aes(fill=county))+
  xlab('cases') + ylab('Frequency') + ggtitle('Histogram of Covid-19 Cases in Louisiana')

And here we do a similar thing to the iris data, just using species as our delimiting factor rather than county/parish.

histogram_iris <- ggplot(data=iris, aes(x=Sepal.Width))

histogram_iris + geom_histogram(binwidth=0.2, color='black', aes(fill=Species)) +
  xlab('Sepal Width') + ylab('Frequency') + ggtitle('Histogram of Iris Sepal Widths by Species')

Maybe a density plot makes more sense for our college data

ggplot(South_Cases)+
  geom_density(aes(x=cases, fill=state), alpha=0.35)

Again with the iris data:

ggplot(iris) +
  geom_density(aes(x=Sepal.Width, fill=Species), alpha=0.35)

Let’s instead try setting up some violin plots for sepal length by species.

ggplot(data=iris, aes(x=Species, y=Sepal.Length, color=Species)) +
  geom_violin() +
  theme_dark() +
  theme(legend.position='none')

Let’s do the same with some of our covid data:

ggplot(data=South_Cases, aes(x=state, y=cases, color=state))+
  geom_violin()+
  theme_dark()+
  theme(legend.position='none')

Now let’s take a look at residual plots. This is a graph that displays the residuals on the vertical axis and the independent variable on the horizontal. In the event that the points in a residual plot are dispersed in a random manner around the horizontal axis, it’s appropriate to use a linear regression. If they are not randomly dispersed, a nonlinear model is more appropriate.

We’ll start demonstrating this using the iris data

ggplot(lm(Sepal.Length~Sepal.Width, data=iris))+
  geom_point(aes(x=.fitted, y=.resid))

As you can see, the iris data appears quite random, so this data could be considered normal and modeled easily by a linear model.

Let’s look at southern states’ covid cases instead:

ggplot(lm(cases~cases_2021, data=South_Cases))+
  geom_point(aes(x=.fitted, y=.resid))

In this case, there’s a very clear bias to the data, indicating that a nonlinear model should be used with it.

Now let’s look at some correlations. We’ll load a new dataset about obesity and insurance rates, as well as our required packages.

library(tidyr)
library(plyr)
library(dplyr)
obesity <- read.csv('~/School/Fall 2022/Bioinf/R Files/Exploratory Data Analysis/Obesity_insurance.csv')

# Let's also check its class and structure
class(obesity)
## [1] "data.frame"
str(obesity)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : chr  "female" "male" "male" "male" ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : chr  "yes" "no" "no" "no" ...
##  $ region  : chr  "southwest" "southeast" "southeast" "northwest" ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Let’s also get a summary of distribution of our variables.

summary(obesity)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Now let’s look at the distribution for insurance charges.

hist(obesity$charges)

We can also look at it as a boxplot.

boxplot(obesity$charges)

Now let’s actually start looking for correlations in our data. The cor() command is used to determine correlations between two vectors, all of the columns in a data frame, or two data frames. The cov() command, however, examines covariance. The cor.test() command carries out a test as to the significance of the correlation.

cor(obesity$charges, obesity$bmi)
## [1] 0.198341

This test uses a Spearman Rho correlation, or you can use Kendall’s Tau by specifying it

cor(obesity$charges, obesity$bmi, method='kendall')
## [1] 0.08252397

This correlation measures strength of correlation between -1 and 1, so a value close to 0 indicates a very low amount of data correlation.

Let’s next look at the Tietjen-Moore test. This is used for univariate datasets. The algorithm depicts the detection of the outliers in a univariate dataset.

TietjenMoore <- function(dataSeries, k)
{
  n = length(dataSeries)
  #Compute the absolute residuals
  r = abs(dataSeries - mean(dataSeries))
  #Sort data according to size of residual
  df = data.frame(dataSeries, r)
  dfs = df[order(df$r),]
  #Create a subset of the data without the largest values
  klarge = c((n-k+1):n)
  subdataSeries = dfs$dataSeries[-klarge]
  #Compute the sum of squares
  ksub = (subdataSeries - mean(subdataSeries))**2
  all = (df$dataSeries - mean(df$dataSeries))**2
  #Compute the test statistic
  sum(ksub)/sum(all)
}

This function helps to compute the absolute residuals and sorts data according to the size of the residuals. Later, we will focus on the computation of sum of squares.

FindOutliersTietjenMooreTest <- function(dataSeries, k, alpha=0.5){
  ek <- TietjenMoore(dataSeries, k)
  #Compute critical values based on simulation
  test = c(1:10000)
  for (i in 1:10000){
    dataSeriesdataSeries=rnorm(length(dataSeries))
    test[i]=TietjenMoore(dataSeriesdataSeries, k)}
  Talpha=quantile(test, alpha)
  list(T=ek, Talpha=Talpha)
}

This function helps us to compute the critical values based on simulation data. Now let’s demonstrate these functions with sample data and the obesity dataset for evaluating this algorithm.

The critical region fo rthe Tietjen-Moore test is determined by simulation. The simulation is performed by generating a standard normal random sample of size n and computing the Tietjen Moore test statistic. Typically, 10,000 random samples are used. The values of the Tietjen-Moore statistic obtained from the data is compared to this reference distribution. The values of the test statistic is between zero and one. If there are no outliers in the data, the test statistic is close to 1. As more outliers appear, the statistic drifts closer to 0. Thus, the test is always a lower, one-tailed test regardless of which test statistic is used - Lk or Ek.

First, we’ll look at charges. Note: the number on our second command is the expected number of outliers.

boxplot(obesity$charges)

FindOutliersTietjenMooreTest(obesity$charges, 50)
## $T
## [1] 0.6751565
## 
## $Talpha
##       50% 
## 0.7731745

Let’s check out bmi next.

boxplot(obesity$bmi)

FindOutliersTietjenMooreTest(obesity$bmi, 7)
## $T
## [1] 0.9475628
## 
## $Talpha
##       50% 
## 0.9505247

Probability Plots

As per usual, we’ll begin by loading our necessary packages. Note that a new package to this section is tigerstats.

library(ggplot2)
library(tigerstats)

We will use the probability plot function and their output dnorm: density function of the normal distribution. Using the density, it is possibile to determine the probability of events.

For example, you may wonder “What is the likelihood that a person has a BMI of exactly X?” In this case, you would need to retrieve the density of the IQ distribution at value X.

Given the BMI distribution is normal and can be modeled with a mean of 100 and a standard deviation of 15, we can find the corresponding density. Below, we input our parameters and generate our density, then plot a scatter plot of bmi by density.

bmi.mean <- mean(obesity$bmi)
bmi.sd <- sd(obesity$bmi)

bmi.dist <- dnorm(obesity$bmi, mean=bmi.mean, sd=bmi.sd)
bmi.df <- data.frame('bmi'=obesity$bmi, 'Density'=bmi.dist)

ggplot(bmi.df, aes(x=bmi, y=Density))+
  geom_point()

This gives us the probability of each individual point occurring. Next, let’s use the pnorm function for another way to display it.

bmi.dist <- pnorm(obesity$bmi, mean=bmi.mean, sd=bmi.sd)
bmi.df <- data.frame('bmi'=obesity$bmi, 'Density'=bmi.dist)

ggplot(bmi.df, aes(x=bmi, y=Density))+
  geom_point()

What if we want to find the probability of the bmi being greater than 40 in our distribution?

pp_greater <- function(x){
  paste(round(100*pnorm(x,, mean=30.66339, sd=6.09818, lower.tail=FALSE), 2), '%')
}

pp_greater(40)
## [1] "6.29 %"

To graphically represent everything included with this function (BMI over 40), we can use pnormGC.

pnormGC(40, region='above', mean=30.66339, sd=9.0818, graph=TRUE)

## [1] 0.1519615

What about the probability that a bmi is less than 40 in our distribution?

pp_less <- function(x){
  paste(round(100*(1-pnorm(x,, mean=30.66339, sd=6.09818, lower.tail=FALSE)), 2), '%')
}

pp_less(40)
## [1] "93.71 %"
pnormGC(40, region='below', mean=30.66339, sd=9.0818, graph=TRUE)

## [1] 0.8480385

What if we want to find the area in between 20 and 40?

pnormGC(c(20,40), region='between', mean=30.66339, sd=6.09818, graph=TRUE)

## [1] 0.8969428

What if we want to know the quantiles? Let’s use the pnorm function. We must assume a normal distribution for this.

What bmi represents the lowest 1% of the population?

qnorm(0.01, mean=30.66339, sd=6.09818, lower.tail=TRUE)
## [1] 16.4769

What if you want a random sampling of values within your distribution?

subset <- rnorm(50, mean=30.66339, sd=6.09818)

hist(subset)

Let’s do it again with more samples from our data. Remember = larger sample sizes means more normal data.

subset <- rnorm(5000, mean=30.66339, sd=6.09818)

hist(subset)

Shapiro-Wilk Tests

So we know how to generate a normal distribution. How do we tell if our samples come from a normal distribution in the first place?

shapiro.test(obesity$charges[1:5])
## 
##  Shapiro-Wilk normality test
## 
## data:  obesity$charges[1:5]
## W = 0.84164, p-value = 0.1695

As you can see, with a small sample size, we reject the null hypothesis that the sample comes from a normal distribution. We can increase the power - the capacity of the test to discern detail - by increasing sample size.

shapiro.test(obesity$charges[1:1000])
## 
##  Shapiro-Wilk normality test
## 
## data:  obesity$charges[1:1000]
## W = 0.8119, p-value < 2.2e-16

Next let’s look at age.

shapiro.test(obesity$age[1:1000])
## 
##  Shapiro-Wilk normality test
## 
## data:  obesity$age[1:1000]
## W = 0.94406, p-value < 2.2e-16

And finally, BMI.

shapiro.test(obesity$bmi[1:1000])
## 
##  Shapiro-Wilk normality test
## 
## data:  obesity$bmi[1:1000]
## W = 0.99471, p-value = 0.001426

Time Series Data

Let’s start this section by loading our packages and data:

library(readr)
library(readxl)
library(lubridate)
library(hms)

Air_data <- read_excel('~/School/Fall 2022/Bioinf/R Files/Exploratory Data Analysis/AirQualityUCI.xlsx')

What is all contained within our data?

  • Date = date of measurement
  • Time = time of measurement
  • CO(GT) = average hourly CO2
  • PT08.s1(CO) = tin oxide hourly average sensor response
  • NMHC = average hourly nonmetallic hydrocarbon concentration
  • C6HC = average hourly benzene concentration
  • PT08.s2(NMHC) = titania average hourly sensor response
  • NOX = average hourly NOX concentration
  • NO2 = average hourly NO2 concentration
  • T = temperature
  • RH = relative humidity
  • AH = absolute humidity

Let’s get a view on our data’s structure:

str(Air_data)
## tibble [9,357 × 15] (S3: tbl_df/tbl/data.frame)
##  $ Date         : POSIXct[1:9357], format: "2004-03-10" "2004-03-10" ...
##  $ Time         : POSIXct[1:9357], format: "1899-12-31 18:00:00" "1899-12-31 19:00:00" ...
##  $ CO(GT)       : num [1:9357] 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
##  $ PT08.S1(CO)  : num [1:9357] 1360 1292 1402 1376 1272 ...
##  $ NMHC(GT)     : num [1:9357] 150 112 88 80 51 38 31 31 24 19 ...
##  $ C6H6(GT)     : num [1:9357] 11.88 9.4 9 9.23 6.52 ...
##  $ PT08.S2(NMHC): num [1:9357] 1046 955 939 948 836 ...
##  $ NOx(GT)      : num [1:9357] 166 103 131 172 131 89 62 62 45 -200 ...
##  $ PT08.S3(NOx) : num [1:9357] 1056 1174 1140 1092 1205 ...
##  $ NO2(GT)      : num [1:9357] 113 92 114 122 116 96 77 76 60 -200 ...
##  $ PT08.S4(NO2) : num [1:9357] 1692 1559 1554 1584 1490 ...
##  $ PT08.S5(O3)  : num [1:9357] 1268 972 1074 1203 1110 ...
##  $ T            : num [1:9357] 13.6 13.3 11.9 11 11.2 ...
##  $ RH           : num [1:9357] 48.9 47.7 54 60 59.6 ...
##  $ AH           : num [1:9357] 0.758 0.725 0.75 0.787 0.789 ...

Let’s tweak our time column by removing the odd redundant dates.

Air_data$Time <- as_hms(Air_data$Time)

glimpse(Air_data)
## Rows: 9,357
## Columns: 15
## $ Date            <dttm> 2004-03-10, 2004-03-10, 2004-03-10, 2004-03-10, 2004-…
## $ Time            <time> 18:00:00, 19:00:00, 20:00:00, 21:00:00, 22:00:00, 23:…
## $ `CO(GT)`        <dbl> 2.6, 2.0, 2.2, 2.2, 1.6, 1.2, 1.2, 1.0, 0.9, 0.6, -200…
## $ `PT08.S1(CO)`   <dbl> 1360.00, 1292.25, 1402.00, 1375.50, 1272.25, 1197.00, …
## $ `NMHC(GT)`      <dbl> 150, 112, 88, 80, 51, 38, 31, 31, 24, 19, 14, 8, 16, 2…
## $ `C6H6(GT)`      <dbl> 11.881723, 9.397165, 8.997817, 9.228796, 6.518224, 4.7…
## $ `PT08.S2(NMHC)` <dbl> 1045.50, 954.75, 939.25, 948.25, 835.50, 750.25, 689.5…
## $ `NOx(GT)`       <dbl> 166, 103, 131, 172, 131, 89, 62, 62, 45, -200, 21, 16,…
## $ `PT08.S3(NOx)`  <dbl> 1056.25, 1173.75, 1140.00, 1092.00, 1205.00, 1336.50, …
## $ `NO2(GT)`       <dbl> 113, 92, 114, 122, 116, 96, 77, 76, 60, -200, 34, 28, …
## $ `PT08.S4(NO2)`  <dbl> 1692.00, 1558.75, 1554.50, 1583.75, 1490.00, 1393.00, …
## $ `PT08.S5(O3)`   <dbl> 1267.50, 972.25, 1074.00, 1203.25, 1110.00, 949.25, 73…
## $ T               <dbl> 13.600, 13.300, 11.900, 11.000, 11.150, 11.175, 11.325…
## $ RH              <dbl> 48.875, 47.700, 53.975, 60.000, 59.575, 59.175, 56.775…
## $ AH              <dbl> 0.7577538, 0.7254874, 0.7502391, 0.7867125, 0.7887942,…

Now let’s do a simple plot of our humidity data.

plot(Air_data$AH, Air_data$RH, main='Humidity Analysis', xlab='Absolute Humidity', ylab='Relative Humidity')

Notice that we have a strong outlier in our data. Let’s run a t-test.

t.test(Air_data$RH, Air_data$AH)
## 
##  Welch Two Sample t-test
## 
## data:  Air_data$RH and Air_data$AH
## t = 69.62, df = 17471, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  45.01707 47.62536
## sample estimates:
## mean of x mean of y 
## 39.483611 -6.837604

As you can see, the averages for our humidity data are significantly different; that outlier is just causing havoc with our plot.

Text Mining

Intro

Text mining is the process of looking through sets of strings, often from social media or publications, and analyzing them on the basis of various different parameters. Most frequently, text mining is used in social media to track sentiment towards companies, social issues, and political candidates or parties. It can also, however, be useful in tracking the frequency of certain words in published words, as well as deduce the percentages of such publications that focus on certain subjects.

Basics

Let’s start with the unnest_token function. We’ll begin by looking at an Emily Dickenson passage.

text <- c("Because I could not stop from Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")

text
## [1] "Because I could not stop from Death -" 
## [2] "He kindly stopped for me -"            
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"

This is an example of a typical character vector that we might want to analyze. We’ll use something called tidytext to make it more easily manipulable. To turn this into a tidytext dataset, we’ll need to put it into a dataframe.

library(dplyr)

text_df <- tibble(line=1:4, text=text)

text_df
## # A tibble: 4 × 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Because I could not stop from Death - 
## 2     2 He kindly stopped for me -            
## 3     3 The Carriage held but just Ourselves -
## 4     4 and Immortality

Reminder: a tibble is a modern class of dataframe within R. It’s available in the dplyr and tibble packages, has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools.

Now, we’ll use the unnest_tokens function.

First, we have to establish the output column name that will be created as the text is unnested into it. We’ll adjust our text so each word is separated into a word column, associated with a line #. This will be done using the tidytext package.

library(tidytext)

text_df %>%
  unnest_tokens(word, text)
## # A tibble: 20 × 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 from       
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality

Let’s use the janeaustenr package to analyze some Jane Austen texts. There are six books in this package.

The below code will group the text by book, then mutate the table such that the line numbers equals the row number, and chapter number is also accounted in its own column.

library(janeaustenr)
library(stringr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber=row_number(),
         chapter=cumsum(str_detect(text, regex('^chapter [\\divxlc]',
                                               ignore_case=TRUE)))) %>%
  ungroup()

original_books
## # A tibble: 73,422 × 4
##    text                    book                linenumber chapter
##    <chr>                   <fct>                    <int>   <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
##  2 ""                      Sense & Sensibility          2       0
##  3 "by Jane Austen"        Sense & Sensibility          3       0
##  4 ""                      Sense & Sensibility          4       0
##  5 "(1811)"                Sense & Sensibility          5       0
##  6 ""                      Sense & Sensibility          6       0
##  7 ""                      Sense & Sensibility          7       0
##  8 ""                      Sense & Sensibility          8       0
##  9 ""                      Sense & Sensibility          9       0
## 10 "CHAPTER 1"             Sense & Sensibility         10       1
## # … with 73,412 more rows

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.

tidybooks <- original_books %>%
  unnest_tokens(word, text)

tidybooks
## # A tibble: 725,055 × 4
##    book                linenumber chapter word       
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # … with 725,045 more rows

This function uses the tokenizers package to separate each line of text in the original dataframe into tokens. The default tokenizing is for words, but other options including characters, n-grams, sentences, lines, or paragraphs can be used.

Now that the data is in a one-word-per-row format, we can manipulate it with tools like dplyr. Often in text analysis, we will want to remove stop words.

Stop words are words that are not useful for an analysis, such as ‘the’, ‘of’, ‘and’, and so on. We can remove stop words (kept in the tidytext dataset ‘stop_words’) with an anti_join().

data(stop_words)

tidier_books <- tidybooks %>%
  anti_join(stop_words)
## Joining, by = "word"

The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together as we have here, or filter() to only use on set of stop if that’s more appropriate for your analysis. Naturally, you can also make your own dictionary dataset to filter specific words.

Let’s compare word counts.

tidybooks %>%
  count(word, sort=TRUE)
## # A tibble: 14,520 × 2
##    word      n
##    <chr> <int>
##  1 the   26351
##  2 to    24044
##  3 and   22515
##  4 of    21178
##  5 a     13408
##  6 her   13055
##  7 i     12006
##  8 in    11217
##  9 was   11204
## 10 it    10234
## # … with 14,510 more rows
tidier_books %>%
  count(word, sort=TRUE)
## # A tibble: 13,914 × 2
##    word       n
##    <chr>  <int>
##  1 miss    1855
##  2 time    1337
##  3 fanny    862
##  4 dear     822
##  5 lady     817
##  6 sir      806
##  7 day      797
##  8 emma     787
##  9 sister   727
## 10 house    699
## # … with 13,904 more rows

Because we’ve been using tidy tools, our word counts are stored in a tidy data frame, allowing us to pipe this directly into ggplot2. For example, we can create a visualization of the most common words.

library(ggplot2)

tidier_books %>%
  count(word, sort=TRUE) %>%
  filter(n > 600) %>% #Include words counted at 600 or more only
  mutate(word=reorder(word, n)) %>% #Reorder by word quantity
  ggplot(aes(n,word)) +
  geom_col() +
  labs(y=NULL, x='word count')

Let’s look at the gutenbergr package next.

This package provides access to the public domain works from the gutenberg project (www.gutenberg.org). This package includes tools for both downloading books and a complete dataset of project gutenberg metadata that can be used to find works of interest. We will mostly use the function gutenberg_download().

We’ll look at word frequencies more first using some biology texts from Darwin:

The Voyage of the Beagle - 944 On the origin of species by means of natural selection - 1228 The expression of emotions in man and animals - 1227 The descent of man, and selection in relation to sex - 2300

library(gutenbergr)

darwin <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = "https://mirror2.sandyriver.net/pub/gutenberg")

Let’s break these into tokens

tidy_darwin <- darwin %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"

Let’s see what the most common Darwin words are

tidy_darwin %>%
  count(word, sort=TRUE)
## # A tibble: 23,630 × 2
##    word          n
##    <chr>     <int>
##  1 species    2998
##  2 male       1672
##  3 males      1337
##  4 animals    1310
##  5 birds      1292
##  6 female     1197
##  7 sexes      1095
##  8 females    1038
##  9 selection  1038
## 10 sexual      801
## # … with 23,620 more rows

Now let’s get some work from Thomas Hunt Morgan, who is credited with discovering chromosomes.

Regeneration - 57198 The genetic and operative evidence relating to secondary sexual characteristics - 57460 Evolution and Adapatation - 63540

morgan <- gutenberg_download(c(57198, 57460, 63540), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')

Next, we’ll tokenize the words

tidy_morgan <- morgan %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"

What are the most common words therein?

tidy_morgan %>%
  count(word, sort=TRUE)
## # A tibble: 13,855 × 2
##    word             n
##    <chr>        <int>
##  1 species        869
##  2 regeneration   814
##  3 piece          702
##  4 cut            669
##  5 male           668
##  6 forms          631
##  7 selection      604
##  8 cells          576
##  9 found          552
## 10 development    546
## # … with 13,845 more rows

Lastly, we’ll look at Thomas Henry Huxley.

Evidence as to mans place in nature - 2931 On the reception of the Origin of Species - 2089 Evolution and Ethics, and Other essays - 2940 Science and Culture, and other essays - 52344

huxley <- gutenberg_download(c(2931, 2089, 2940, 52344), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')

tidy_huxley <- huxley %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"

And let’s tally words.

tidy_huxley %>%
  count(word, sort=TRUE)
## # A tibble: 16,090 × 2
##    word          n
##    <chr>     <int>
##  1 species     339
##  2 nature      331
##  3 time        287
##  4 life        286
##  5 existence   255
##  6 knowledge   238
##  7 animals     227
##  8 natural     223
##  9 animal      216
## 10 science     207
## # … with 16,080 more rows

Now, let’s calculate the frequency for each word for the works of Darwin, Morgan, and Huxley by binding the frames together.

library(tidyr)

frequency <- bind_rows(mutate(tidy_morgan, author = 'Thomas Hunt Morgan'),
                       mutate(tidy_darwin, author = 'Charles Darwin'), 
                       mutate(tidy_huxley, author = 'Thomas Henry Huxley')) %>%
  mutate(word = str_extract(word, '[a-z]+')) %>% #only count words starting with a-z
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n/sum(n)) %>%
  select(-n) %>%
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer('Thomas Hunt Morgan':'Charles Darwin', names_to = 'author', values_to = 'proportion')

frequency
## # A tibble: 95,082 × 3
##    word    author               proportion
##    <chr>   <chr>                     <dbl>
##  1 a       Thomas Hunt Morgan   0.00206   
##  2 a       Thomas Henry Huxley  0.0000856 
##  3 a       Charles Darwin       0.000141  
##  4 ab      Thomas Hunt Morgan   0.000165  
##  5 ab      Thomas Henry Huxley  0.0000978 
##  6 ab      Charles Darwin       0.00000642
##  7 abaiss  Thomas Hunt Morgan  NA         
##  8 abaiss  Thomas Henry Huxley NA         
##  9 abaiss  Charles Darwin       0.00000642
## 10 abandon Thomas Hunt Morgan   0.00000752
## # … with 95,072 more rows

Now, we need to change the table so that each author has its own row.

frequency2 <- pivot_wider(frequency, names_from='author', values_from='proportion')

frequency2
## # A tibble: 31,694 × 4
##    word        `Thomas Hunt Morgan` `Thomas Henry Huxley` `Charles Darwin`
##    <chr>                      <dbl>                 <dbl>            <dbl>
##  1 a                     0.00206                0.0000856       0.000141  
##  2 ab                    0.000165               0.0000978       0.00000642
##  3 abaiss               NA                     NA               0.00000642
##  4 abandon               0.00000752             0.0000122       0.00000321
##  5 abandoned             0.0000150              0.0000245       0.00000321
##  6 abashed              NA                     NA               0.00000321
##  7 abatement            NA                      0.0000245       0.00000321
##  8 abbot                NA                      0.0000245       0.00000321
##  9 abbott               NA                     NA               0.00000642
## 10 abbreviated          NA                     NA               0.0000128 
## # … with 31,684 more rows

Now we’ll plot the data.

library(scales)

ggplot(frequency2, aes(x=`Charles Darwin`, y=`Thomas Hunt Morgan`), color=abs(-`Charles Darwin`-`Thomas Hunt Morgan`)) +
  geom_abline(color='gray40', lty=2) +
  geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
  geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
  scale_x_log10(labels=percent_format()) +
  scale_y_log10(labels=percent_format()) +
  scale_color_gradient(limits=c(0,0.001), 
                       low='indianred', high='steelblue') +
   theme(legend.position = 'none') +
  labs(y='Thomas Hunt Morgan', x='Charles Darwin')

ggplot(frequency2, aes(x = `Thomas Hunt Morgan`, y = `Thomas Henry Huxley`), color = abs(`Thomas Hunt Morgan` -`Thomas Henry Huxley`)) +
  geom_abline(color = 'grey40', lty = 2)+
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0,0.001),
                       low = 'darkslategray4', high = 'gray75') +
   theme(legend.position='none') +
  labs(y = 'Thomas Henry Huxley', x = 'Thomas Hunt Morgan')

Sentiment Analysis

There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text.

Some examples:

  • AFINN: assigns a score between -5 and 5, with negative numbers indicating negative sentiment, and the number itself describing magnitude of the sentiment

  • bing: categorizes words in a binary fashion into positive or negative

  • nrc: categorizes words into positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust

The function get_sentiments() allows us to get the specific sentiments lexicon with the measures for each one. Let’s set up objects for our dictionaries/methods and take a look at them.

library(tidytext)
library(textdata)

afinn <- get_sentiments('afinn')

afinn
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows
bing <- get_sentiments('bing')

bing
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # … with 6,776 more rows
nrc <- get_sentiments('nrc')

nrc
## # A tibble: 13,872 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,862 more rows

These libraries were created either using crowdsources, cloud computing/AI like Amazon Mechanical Turk, or by direct labor by an author (validated by crowdsourcing)

Let’s look at the words with a joy score from some of Darwin’s works using nrc. We’ll start by setting up our wordbank from Darwin’s books.

library(gutenbergr)
library(dplyr)
library(stringr)

darwin <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')

tidy_books <- darwin %>%
  group_by(gutenberg_id) %>%
  mutate(linenumber=row_number(), chapter=cumsum(str_detect(text, regex('^chapter [\\divxlc]', ignore_case=TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

tidy_books
## # A tibble: 786,575 × 4
##    gutenberg_id linenumber chapter word   
##           <int>      <int>   <int> <chr>  
##  1          944          1       0 the    
##  2          944          1       0 voyage 
##  3          944          1       0 of     
##  4          944          1       0 the    
##  5          944          1       0 beagle 
##  6          944          1       0 by     
##  7          944          2       0 charles
##  8          944          2       0 darwin 
##  9          944          8       0 about  
## 10          944          8       0 the    
## # … with 786,565 more rows

Let’s add the book name instead of just using the gutenberg ID.

colnames(tidy_books)[1] <- 'book' #first just changing the value name of this column to book

tidy_books$book[tidy_books$book == 944] <- "The Voyage of the Beagle" #set book value of 944 to equal our string
tidy_books$book[tidy_books$book == 1227] <- "The Expression of the Emotions in Man and Animals"
tidy_books$book[tidy_books$book == 1228] <- "On the Origin of Species By Means of Natural Selection"
tidy_books$book[tidy_books$book == 2300] <- "The Descent of Man, and Selection in Relation to Sex"

Now that we have a tidy format with one word per row, we are ready for sentiment analysis.

nrc_joy <- get_sentiments('nrc') %>% #filter out only joy terms from nrc and store in a new object
  filter(sentiment == 'joy')

tidy_books %>%
  filter(book == 'The Voyage of the Beagle') %>% #only use The Voyage of the Beagle
  inner_join(nrc_joy) %>% #only pull out words that match the joy list
  count(word, sort = TRUE) #count those words
## # A tibble: 277 × 2
##    word           n
##    <chr>      <int>
##  1 found        301
##  2 good         161
##  3 remarkable   114
##  4 green         95
##  5 kind          92
##  6 tree          86
##  7 present       85
##  8 food          78
##  9 beautiful     61
## 10 elevation     60
## # … with 267 more rows

We can also examine how sentiment changes throughout a work. This can also be done over time looking at something like COVID.

library(tidyr)

Charles_Darwin_sentiment <- tidy_books %>%
  inner_join(get_sentiments('bing')) %>%
  count(book, index=linenumber %/% 80, sentiment) %>% #for each book, index 80 lines into a singular bin. Note %/%, which is interger divison, being used for this
  pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
  mutate(sentiment=positive-negative)

Now let’s plot it to see how sentiment changes throughout the book.

library(ggplot2)

ggplot(Charles_Darwin_sentiment, aes(index, sentiment, fill=book)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~book, ncol=2, scales='free_x')

Note of course that this cannot accurately represent context, sarcasm, and such. For example, “not bad” may be recorded as a negative sentiment despite being a positive sentiment in typical speech. This analysis is very preliminary and rudimentary.

Let’s now compare the three sentiment dictionaries. Each one can serve a different purpose, so it’s important to select one that gives you the information that is most usable to you. Below we’ll go through all the steps for setting up all 3 datasets for each lexicon.

voyage <- tidy_books %>%
  filter(book == 'The Voyage of the Beagle')
afinn_voyage <- voyage %>%
  inner_join(get_sentiments('afinn')) %>%
  group_by(index=linenumber %/% 80) %>% #Again here, we use %/% to say "index every 80 linenumbers into groups"
  summarise(sentiment=sum(value)) %>%
  mutate(method='AFINN')
bing_nrc <- bind_rows(voyage %>%
                        inner_join(get_sentiments('bing')) %>%
                        mutate(method='Bing et al.'),
                      voyage %>%
                        inner_join(get_sentiments('nrc') %>%
                                     filter(sentiment %in% c('positive', 'negative'))) %>% #Recall that nrc has many emotion tags; here, we're filtering for only positive/negative for use with bing
                        mutate(method='nrc')) %>%
  count(method, index=linenumber %/% 80, sentiment) %>%
  pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>% #Use names from sentiment for table, use n (our count) to get values, and if there's no value, input 0
  mutate(sentiment=positive-negative)

We can now estimate the net sentiment (positive - negative) in each chunk of the novel text for each lexicon. Let’s bind it all together and visualize with ggplot.

bind_rows(afinn_voyage, bing_nrc) %>%
  ggplot(aes(index, sentiment, fill=method)) +
  geom_col(show.legend=TRUE) +
  facet_wrap(~method, ncol=1, scales='free_y')

Let’s look at the counts based on each dictionary

get_sentiments('nrc') %>%
  filter(sentiment %in% c('positive', 'negative')) %>%
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3316
## 2 positive   2308
get_sentiments('bing') %>%
  count(sentiment)
## # A tibble: 2 × 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005

We can also sort out the most common words while tagged with their associated sentiments.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments('bing')) %>%
  count(word, sentiment, sort=TRUE) %>%
  ungroup()

bing_word_counts
## # A tibble: 2,492 × 3
##    word       sentiment     n
##    <chr>      <chr>     <int>
##  1 great      positive   1226
##  2 well       positive    855
##  3 like       positive    813
##  4 good       positive    487
##  5 doubt      negative    414
##  6 wild       negative    317
##  7 respect    positive    310
##  8 remarkable positive    295
##  9 important  positive    281
## 10 bright     positive    258
## # … with 2,482 more rows

This can be shown visually via piping straight into ggplot2

bing_word_counts %>%
  group_by(sentiment) %>%
  slice_max(n, n=10) %>% #Similar to binning
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word, fill=sentiment)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~sentiment, scale='free_y') +
  labs(x='Contribution to Sentiment', y=NULL)

Let’s see if we can spot any anomalies in the dataset. We’ll remove a couple words that are potentially being misinterpreted by our lexicon; wild, dark, like, and great might be used in more neutral contexts than would indicate emotion.

custom_stop_words <- bind_rows(tibble(word = c('wild', 'dark', 'great', 'like'), lexicon = c('custom')), stop_words)

custom_stop_words
## # A tibble: 1,153 × 2
##    word      lexicon
##    <chr>     <chr>  
##  1 wild      custom 
##  2 dark      custom 
##  3 great     custom 
##  4 like      custom 
##  5 a         SMART  
##  6 a's       SMART  
##  7 able      SMART  
##  8 about     SMART  
##  9 above     SMART  
## 10 according SMART  
## # … with 1,143 more rows

You could use the above custom_stop_words lexicon to filter out all of those words plus standard stop words to check if texts are still as positive/negative with those potential misinterpretations removed.

Next, let’s look at Word Clouds.

We can see that tidy text mining and sentiment analysis all works well with ggplot2, but having or data in tidy format leads to other nice graphical techiniques.

library(wordcloud)

tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words=100))

Let’s also look at comparison.cloud(), which may require turning the dataframe into a matrix. We can do so using the acast() function.

library(reshape2)

tidy_books %>%
  inner_join(get_sentiments('bing')) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var='n', fill=0) %>%
  comparison.cloud(colors=c('steelblue', 'forestgreen'), max.words = 100)

As you can see, comparison.cloud() gave us a new layer of information by adding colors to define which words are positive and which words are negative. This can not only show us the most common words, but also compare volume of positive to negative words used.

Looking at Units Beyond Words

Lots of useful work can be done by tokenizing at the word level, but sometimes it’s nice to look at different units of text. For example, we can look beyond just unigrams.

Example: I am not having a good day.

Given the presence of “not” and “good”, this could be marked as a neutral sentence/line since “not” would count as negative, and “good” would count as positive. Other processing may be better to find appropriate sentiment.

bingnegative <- get_sentiments('bing') %>%
  filter(sentiment == 'negative')

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords=n()) %>%
  left_join(wordcounts, by=c('book', 'chapter')) %>%
  mutate(ratio=negativewords/words) %>%
  filter(chapter!=0) %>%
  slice_max(ratio,n=1) %>%
  ungroup()
## # A tibble: 4 × 5
##   book                                              chapter negat…¹ words  ratio
##   <chr>                                               <int>   <int> <int>  <dbl>
## 1 On the Origin of Species By Means of Natural Sel…       3       5    86 0.0581
## 2 The Descent of Man, and Selection in Relation to…      20       4    87 0.0460
## 3 The Expression of the Emotions in Man and Animals      10     249  4220 0.0590
## 4 The Voyage of the Beagle                               10     375 11202 0.0335
## # … with abbreviated variable name ¹​negativewords

This shows us the most negative chapters of each book, and their ratios of negativity to all other words.

N-Grams

So far we’ve only looked at single words, but many more accurate analyses are based on the relationship between words, allowing for some amount of context in the analysis.

Let’s look at some methods of tidytext for calculating and visualizing word relationships. We’ll begin by loading books from our previous work with word analyses.

library(dplyr)
library(tidytext)
library(gutenbergr)
library(ggplot2)

darwin_books <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')

colnames(darwin_books)[1] <- 'book'

darwin_books$book[darwin_books$book == 944] <- 'The Voyage of the Beagle'
darwin_books$book[darwin_books$book == 1227] <- 'The Expression of the Emotions in Man and Animals'
darwin_books$book[darwin_books$book == 1228] <- 'On the Origin of Species By Means of Natural Selection'
darwin_books$book[darwin_books$book == 2300] <- 'The Descent of Man, and Selection in Relation to Sex'

darwin_bigrams <- darwin_books %>%
  unnest_tokens(bigram, text, token='ngrams', n=2)

darwin_bigrams
## # A tibble: 724,531 × 2
##    book                     bigram        
##    <chr>                    <chr>         
##  1 The Voyage of the Beagle the voyage    
##  2 The Voyage of the Beagle voyage of     
##  3 The Voyage of the Beagle of the        
##  4 The Voyage of the Beagle the beagle    
##  5 The Voyage of the Beagle beagle by     
##  6 The Voyage of the Beagle charles darwin
##  7 The Voyage of the Beagle <NA>          
##  8 The Voyage of the Beagle <NA>          
##  9 The Voyage of the Beagle <NA>          
## 10 The Voyage of the Beagle <NA>          
## # … with 724,521 more rows

This data is still in tidytext format and is structured as one-token-per-row. Each token is a bigram; a two-word combination (overlapping)

Counting and filtering n-grams (n standing for any number):

darwin_bigrams %>%
  count(bigram, sort=TRUE)
## # A tibble: 238,516 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the   11297
##  2 <NA>      8947
##  3 in the    5257
##  4 on the    4093
##  5 to the    2849
##  6 the same  2048
##  7 that the  1947
##  8 it is     1830
##  9 of a      1610
## 10 and the   1590
## # … with 238,506 more rows

Most of the common bigrams are stop-words. This can be a good time to use tidyr’s separate command which splits a column into two based on a delimiter. This will let us make a column for word one and word two.

library(tidyr)

bigrams_separated <- darwin_bigrams %>%
  separate(bigram, c('word1', 'word2'), sep=' ') #This function is effectively saying "look at bigram, separate the columns into word1 and word2, separating the actual data by the space between them"

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) #Filter out all word1/word2 entries that are found in the stop_word list on the word column

bigrams_separated
## # A tibble: 724,531 × 3
##    book                     word1   word2 
##    <chr>                    <chr>   <chr> 
##  1 The Voyage of the Beagle the     voyage
##  2 The Voyage of the Beagle voyage  of    
##  3 The Voyage of the Beagle of      the   
##  4 The Voyage of the Beagle the     beagle
##  5 The Voyage of the Beagle beagle  by    
##  6 The Voyage of the Beagle charles darwin
##  7 The Voyage of the Beagle <NA>    <NA>  
##  8 The Voyage of the Beagle <NA>    <NA>  
##  9 The Voyage of the Beagle <NA>    <NA>  
## 10 The Voyage of the Beagle <NA>    <NA>  
## # … with 724,521 more rows
bigrams_filtered
## # A tibble: 94,896 × 3
##    book                     word1   word2  
##    <chr>                    <chr>   <chr>  
##  1 The Voyage of the Beagle charles darwin 
##  2 The Voyage of the Beagle <NA>    <NA>   
##  3 The Voyage of the Beagle <NA>    <NA>   
##  4 The Voyage of the Beagle <NA>    <NA>   
##  5 The Voyage of the Beagle <NA>    <NA>   
##  6 The Voyage of the Beagle <NA>    <NA>   
##  7 The Voyage of the Beagle online  edition
##  8 The Voyage of the Beagle <NA>    <NA>   
##  9 The Voyage of the Beagle degree  symbol 
## 10 The Voyage of the Beagle degs    italics
## # … with 94,886 more rows

New bigram counts:

bigram_counts <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep=' ')

bigram_counts
## # A tibble: 94,896 × 2
##    book                     bigram        
##    <chr>                    <chr>         
##  1 The Voyage of the Beagle charles darwin
##  2 The Voyage of the Beagle NA NA         
##  3 The Voyage of the Beagle NA NA         
##  4 The Voyage of the Beagle NA NA         
##  5 The Voyage of the Beagle NA NA         
##  6 The Voyage of the Beagle NA NA         
##  7 The Voyage of the Beagle online edition
##  8 The Voyage of the Beagle NA NA         
##  9 The Voyage of the Beagle degree symbol 
## 10 The Voyage of the Beagle degs italics  
## # … with 94,886 more rows

We may also be interested in trigrams (three-word combos). Let’s set those up.

darwin_trigrams <- darwin_books %>%
  unnest_tokens(trigram, text, token='ngrams', n=3) %>%
  separate(trigram, c('word1','word2','word3'), sep=' ') %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>%
  count(word1, word2, word3, sort=TRUE)

darwin_trigrams
## # A tibble: 19,971 × 4
##    word1         word2  word3           n
##    <chr>         <chr>  <chr>       <int>
##  1 <NA>          <NA>   <NA>         9884
##  2 tierra        del    fuego          92
##  3 secondary     sexual characters     91
##  4 captain       fitz   roy            45
##  5 closely       allied species        30
##  6 de            la     physionomie    30
##  7 domestication vol    ii             26
##  8 vol           ii     pp             22
##  9 vertebrates   vol    iii            21
## 10 proc          zoolog soc            18
## # … with 19,961 more rows

Let’s actually analyze some of our bigrams from earlier.

bigrams_filtered %>%
  filter(word2 == 'selection') %>%
  count(book, word1, sort=TRUE)
## # A tibble: 39 × 3
##    book                                                   word1           n
##    <chr>                                                  <chr>       <int>
##  1 The Descent of Man, and Selection in Relation to Sex   sexual        254
##  2 On the Origin of Species By Means of Natural Selection natural       250
##  3 The Descent of Man, and Selection in Relation to Sex   natural       156
##  4 On the Origin of Species By Means of Natural Selection sexual         18
##  5 On the Origin of Species By Means of Natural Selection continued       6
##  6 The Descent of Man, and Selection in Relation to Sex   unconscious     6
##  7 On the Origin of Species By Means of Natural Selection methodical      5
##  8 The Descent of Man, and Selection in Relation to Sex   continued       5
##  9 On the Origin of Species By Means of Natural Selection unconscious     4
## 10 The Expression of the Emotions in Man and Animals      natural         4
## # … with 29 more rows
#Here we're looking at what single word1 values are most commonly paired with the word 'selection' for each book

Let’s again look at tf-idf across bigrams throughout Darwin’s works; recall that idf defines the rarity/commonality of the word compared to other works (higher number = unique to that book)

bigram_tf_idf <- bigram_counts %>%
  count(book, bigram) %>%
  bind_tf_idf(bigram, book, n) %>%
  arrange(desc(tf_idf))

bigram_tf_idf
## # A tibble: 60,595 × 6
##    book                                       bigram     n      tf   idf  tf_idf
##    <chr>                                      <chr>  <int>   <dbl> <dbl>   <dbl>
##  1 The Expression of the Emotions in Man and… nerve…    47 0.00350 1.39  0.00485
##  2 On the Origin of Species By Means of Natu… natur…   250 0.0160  0.288 0.00460
##  3 The Expression of the Emotions in Man and… la ph…    35 0.00260 1.39  0.00361
##  4 The Voyage of the Beagle                   bueno…    54 0.00245 1.39  0.00339
##  5 The Voyage of the Beagle                   capta…    53 0.00240 1.39  0.00333
##  6 On the Origin of Species By Means of Natu… glaci…    36 0.00230 1.39  0.00319
##  7 The Voyage of the Beagle                   fitz …    50 0.00227 1.39  0.00314
##  8 The Expression of the Emotions in Man and… muscl…    30 0.00223 1.39  0.00310
##  9 The Expression of the Emotions in Man and… orbic…    29 0.00216 1.39  0.00299
## 10 The Expression of the Emotions in Man and… dr du…    26 0.00194 1.39  0.00268
## # … with 60,585 more rows

Now let’s graph those relationships, separating by book:

bigram_tf_idf %>%
  arrange(desc(tf_idf)) %>%
  group_by(book) %>%
  slice_max(tf_idf, n=10) %>%
  ungroup() %>%
  mutate(bigram=reorder(bigram, tf_idf)) %>%
  ggplot(aes(tf_idf, bigram, fill=book)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~book, ncol=2, scales='free') +
  labs(x='tf-idf of bigrams', y=NULL)

Using bigrams to provide context in sentiment analysis

bigrams_separated %>%
  filter(word1 == 'not') %>%
  count(word1, word2, sort=TRUE)
## # A tibble: 867 × 3
##    word1 word2     n
##    <chr> <chr> <int>
##  1 not   be      128
##  2 not   have    104
##  3 not   only    103
##  4 not   a       100
##  5 not   to       98
##  6 not   been     89
##  7 not   the      82
##  8 not   at       70
##  9 not   know     60
## 10 not   so       58
## # … with 857 more rows

By doing sentiment analysis on bigrams, we can examine how often sentiment-associated words are preceded by a modifier like ‘not’ or other negating words. Here’s the sentiment values of words in the afinn lexicon:

afinn <- get_sentiments('afinn')

afinn
## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

We can examine the most frequent words that were preceded by “not”, and associate with sentiment.

not_words <- bigrams_separated %>%
  filter(word1 == 'not') %>%
  inner_join(afinn, by=c(word2='word')) %>%
  count(word2, value, sort=TRUE)

not_words
## # A tibble: 114 × 3
##    word2     value     n
##    <chr>     <dbl> <int>
##  1 doubt        -1    25
##  2 like          2    11
##  3 pretend      -1     9
##  4 wish          1     8
##  5 admit        -1     7
##  6 difficult    -1     5
##  7 easy          1     5
##  8 reach         1     5
##  9 extend        1     4
## 10 forget       -1     4
## # … with 104 more rows

Now we can see that “not like” has a positive value, but should likely be counted as negative in-context. The 11 occurrences of this may skew our analysis.

Let’s visualize:

not_words %>%
  mutate(contribution=n*value) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2=reorder(word2, contribution)) %>%
  ggplot(aes(n*value, word2, fill=n*value > 0)) +
  geom_col(show.legend=FALSE) +
  labs(x='Sentiment Value * Number of Occurences', y='Words Preceded by \"Not\"')

So as we can see, ‘like’ scores very high in its sentiment value effect despite being preceded by ‘not’. This gives us a good impression of what positive words are likely skewing our data in the positive direction due to a lack of bigram context.

negation_words <- c('not', 'no', 'never', 'non', 'without')

negated_words <- bigrams_separated %>% #This whole thing will pull out all word combos using our negation words
  filter(word1 %in% negation_words) %>%
  inner_join(afinn, by=c(word2='word')) %>%
  count(word1, word2, value, sort=TRUE)

negated_words
## # A tibble: 208 × 4
##    word1   word2     value     n
##    <chr>   <chr>     <dbl> <int>
##  1 no      doubt        -1   210
##  2 not     doubt        -1    25
##  3 no      great         3    19
##  4 not     like          2    11
##  5 not     pretend      -1     9
##  6 not     wish          1     8
##  7 without doubt        -1     8
##  8 not     admit        -1     7
##  9 no      greater       3     6
## 10 not     difficult    -1     5
## # … with 198 more rows

Let’s visualize our negation words

negated_words %>%
  mutate(contribution = n*value, word2=reorder(paste(word2, word1, sep='_'), contribution)) %>%
  group_by(word1) %>%
  slice_max(abs(contribution), n=12, with_ties=FALSE) %>%
  ggplot(aes(word2, contribution, fill=n*value>0)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~word1, scales='free') +
  scale_x_discrete(labels=function(x) gsub('_.+$', '', x)) +
  xlab('Words Preceded by Negation Term') +
  ylab('Sentiment Value * Number of Occurences') +
  coord_flip()

Let’s visualize a network of bigrams using ggraph:

library(igraph)

bigram_counts2 <- bigrams_filtered %>%
  count(word1, word2, sort=TRUE)

bigram_graph <- bigram_counts2 %>%
  filter(n>20) %>%
  graph_from_data_frame()

bigram_graph
## IGRAPH 2b7850a DN-- 203 140 -- 
## + attr: name (v/c), n (e/n)
## + edges from 2b7850a (vertex names):
##  [1] NA        ->NA          natural   ->selection   sexual    ->selection  
##  [4] vol       ->ii          lower     ->animals     sexual    ->differences
##  [7] south     ->america     distinct  ->species     secondary ->sexual     
## [10] breeding  ->season      closely   ->allied      sexual    ->characters 
## [13] tierra    ->del         del       ->fuego       vol       ->iii        
## [16] de        ->la          natural   ->history     fresh     ->water      
## [19] north     ->america     bright    ->colours     sexual    ->difference 
## [22] allied    ->species     tail      ->feathers    strongly  ->marked     
## + ... omitted several edges
library(ggraph)
set.seed(1234)

ggraph(bigram_graph, layout='fr') +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label=name), vjust=1, hjust=1)

We can also add directionality to this network

set.seed(1234)

a <- grid::arrow(type='closed', length=unit(0.15, 'inches'))

ggraph(bigram_graph, layout='fr') +
  geom_edge_link(aes(edge_alpha=n), show.legend=FALSE, arrow=a, end_cap=circle(0.07, 'inches')) +
  geom_node_point(color='lightblue', size=3) +
  geom_node_text(aes(label=name), vjust=1, hjust=1) +
  theme_void()

As you can see, the new graph gives us arrows that show the directionality of these word relationships, with darker arrows making for stronger connections.

Word Frequencies

A central question in text mining is how to quantify what a document is about. We can do this by looking at words that make up the document and measuring term frequency.

There are a lot of words that may not be important; these are our stop words that we’re familiar with. One way to remedy this is to look at inverse document frequency words, which decreases the weight for commonly used words and increases the weight for words that are more specific.

Let’s look at term frequency from Darwin’s works

library(dplyr)
library(tidytext)
library(gutenbergr)

book_words <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')

colnames(book_words)[1] <- 'book'

book_words$book[book_words$book == 944] <- "The Voyage of the Beagle"
book_words$book[book_words$book == 1227] <- "The Expression of the Emotions in Man and Animals"
book_words$book[book_words$book == 1228] <- "On the Origin of Species By Means of Natural Selection"
book_words$book[book_words$book == 2300] <- "The Descent of Man, and Selection in Relation to Sex"

Now let’s disect our text

book_words <- book_words %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort=TRUE)

book_words
## # A tibble: 43,024 × 3
##    book                                                   word      n
##    <chr>                                                  <chr> <int>
##  1 The Descent of Man, and Selection in Relation to Sex   the   25490
##  2 The Voyage of the Beagle                               the   16930
##  3 The Descent of Man, and Selection in Relation to Sex   of    16762
##  4 On the Origin of Species By Means of Natural Selection the   10301
##  5 The Voyage of the Beagle                               of     9438
##  6 The Descent of Man, and Selection in Relation to Sex   in     8882
##  7 The Expression of the Emotions in Man and Animals      the    8045
##  8 On the Origin of Species By Means of Natural Selection of     7864
##  9 The Descent of Man, and Selection in Relation to Sex   and    7854
## 10 The Descent of Man, and Selection in Relation to Sex   to     5901
## # … with 43,014 more rows

As you can see, the words as they are just say nothing. Let’s also get the totals for every book.

book_words$n <- as.numeric(book_words$n)

total_words <- book_words %>%
  group_by(book) %>%
  summarize(total = sum(n))

book_words
## # A tibble: 43,024 × 3
##    book                                                   word      n
##    <chr>                                                  <chr> <dbl>
##  1 The Descent of Man, and Selection in Relation to Sex   the   25490
##  2 The Voyage of the Beagle                               the   16930
##  3 The Descent of Man, and Selection in Relation to Sex   of    16762
##  4 On the Origin of Species By Means of Natural Selection the   10301
##  5 The Voyage of the Beagle                               of     9438
##  6 The Descent of Man, and Selection in Relation to Sex   in     8882
##  7 The Expression of the Emotions in Man and Animals      the    8045
##  8 On the Origin of Species By Means of Natural Selection of     7864
##  9 The Descent of Man, and Selection in Relation to Sex   and    7854
## 10 The Descent of Man, and Selection in Relation to Sex   to     5901
## # … with 43,014 more rows

Now let’s join our book_words and totals.

book_words <- left_join(book_words, total_words)

book_words
## # A tibble: 43,024 × 4
##    book                                                   word      n  total
##    <chr>                                                  <chr> <dbl>  <dbl>
##  1 The Descent of Man, and Selection in Relation to Sex   the   25490 311041
##  2 The Voyage of the Beagle                               the   16930 208118
##  3 The Descent of Man, and Selection in Relation to Sex   of    16762 311041
##  4 On the Origin of Species By Means of Natural Selection the   10301 157002
##  5 The Voyage of the Beagle                               of     9438 208118
##  6 The Descent of Man, and Selection in Relation to Sex   in     8882 311041
##  7 The Expression of the Emotions in Man and Animals      the    8045 110414
##  8 On the Origin of Species By Means of Natural Selection of     7864 157002
##  9 The Descent of Man, and Selection in Relation to Sex   and    7854 311041
## 10 The Descent of Man, and Selection in Relation to Sex   to     5901 311041
## # … with 43,014 more rows

As you can see, the usual suspects are the most common words and still don’t tell us anything.

Let’s try to fix that.

library(ggplot2)

ggplot(book_words, aes(n/total, fill=book)) +
  geom_histogram(show.legend=FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol=2, scales='free_y')

Zipf’s Law - The frequency that a word appears is inversely proportional to its rank. Let’s create a dataset that applies a rank to our words based on Zipf’s law.

freq_by_rank <- book_words %>%
  group_by(book) %>%
  mutate(rank=row_number(),
         'term frequency'=n/total) %>%
  ungroup()

freq_by_rank
## # A tibble: 43,024 × 6
##    book                                         word      n  total  rank term …¹
##    <chr>                                        <chr> <dbl>  <dbl> <int>   <dbl>
##  1 The Descent of Man, and Selection in Relati… the   25490 311041     1  0.0820
##  2 The Voyage of the Beagle                     the   16930 208118     1  0.0813
##  3 The Descent of Man, and Selection in Relati… of    16762 311041     2  0.0539
##  4 On the Origin of Species By Means of Natura… the   10301 157002     1  0.0656
##  5 The Voyage of the Beagle                     of     9438 208118     2  0.0453
##  6 The Descent of Man, and Selection in Relati… in     8882 311041     3  0.0286
##  7 The Expression of the Emotions in Man and A… the    8045 110414     1  0.0729
##  8 On the Origin of Species By Means of Natura… of     7864 157002     2  0.0501
##  9 The Descent of Man, and Selection in Relati… and    7854 311041     4  0.0253
## 10 The Descent of Man, and Selection in Relati… to     5901 311041     5  0.0190
## # … with 43,014 more rows, and abbreviated variable name ¹​`term frequency`

Now let’s plot it proper:

freq_by_rank %>%
  ggplot(aes(rank, `term frequency`, color=book)) +
  geom_line(size=1.1, alpha=0.8, show.legend=FALSE) +
  scale_x_log10() +
  scale_y_log10()

Lets us TF - IDF to find words for each document by decreasing the weight for commonly used words and increasing the weight for words less used

book_tf_idf <- book_words %>%
  bind_tf_idf(word, book, n)

book_tf_idf
## # A tibble: 43,024 × 7
##    book                                   word      n  total     tf   idf tf_idf
##    <chr>                                  <chr> <dbl>  <dbl>  <dbl> <dbl>  <dbl>
##  1 The Descent of Man, and Selection in … the   25490 311041 0.0820     0      0
##  2 The Voyage of the Beagle               the   16930 208118 0.0813     0      0
##  3 The Descent of Man, and Selection in … of    16762 311041 0.0539     0      0
##  4 On the Origin of Species By Means of … the   10301 157002 0.0656     0      0
##  5 The Voyage of the Beagle               of     9438 208118 0.0453     0      0
##  6 The Descent of Man, and Selection in … in     8882 311041 0.0286     0      0
##  7 The Expression of the Emotions in Man… the    8045 110414 0.0729     0      0
##  8 On the Origin of Species By Means of … of     7864 157002 0.0501     0      0
##  9 The Descent of Man, and Selection in … and    7854 311041 0.0253     0      0
## 10 The Descent of Man, and Selection in … to     5901 311041 0.0190     0      0
## # … with 43,014 more rows

Now let’s look at specifically terms with a high tf-idf.

book_tf_idf %>%
  select(-total) %>%
  arrange(desc(tf_idf))
## # A tibble: 43,024 × 6
##    book                                        word      n      tf   idf  tf_idf
##    <chr>                                       <chr> <dbl>   <dbl> <dbl>   <dbl>
##  1 The Expression of the Emotions in Man and … tears   126 1.14e-3 1.39  1.58e-3
##  2 The Expression of the Emotions in Man and … blush   114 1.03e-3 1.39  1.43e-3
##  3 The Expression of the Emotions in Man and … eyeb…   149 1.35e-3 0.693 9.35e-4
##  4 The Voyage of the Beagle                    degs    117 5.62e-4 1.39  7.79e-4
##  5 On the Origin of Species By Means of Natur… sele…   412 2.62e-3 0.288 7.55e-4
##  6 The Descent of Man, and Selection in Relat… sexu…   745 2.40e-3 0.288 6.89e-4
##  7 The Descent of Man, and Selection in Relat… shewn   143 4.60e-4 1.39  6.37e-4
##  8 On the Origin of Species By Means of Natur… hybr…   133 8.47e-4 0.693 5.87e-4
##  9 The Expression of the Emotions in Man and … frown    46 4.17e-4 1.39  5.78e-4
## 10 The Descent of Man, and Selection in Relat… sele…   621 2.00e-3 0.288 5.74e-4
## # … with 43,014 more rows

Let’s look at a visualization for these high tf-idf words.

library(forcats)

book_tf_idf %>%
  group_by(book) %>%
  slice_max(tf_idf, n=15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill=book)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~book, ncol=2, scales='free') +
  labs(x='tf-idf',y=NULL)

Text Analysis with APIs

Intro

Text mining as outlined in the previous section has many common uses in today’s modern social media landscape. Companies, political parties, and more utilize text mining alongside APIs to measure public opions on current issues, their brand or identity, and more. Knowing how to use these APIs and analyze data pulled from them is a marketable skill in more ways than one.

Downloading from the Twitter API

Downloading using the Twitter API requires a sign-up process that grants you a specific token you utilize to pull tweets from certain date ranges. Below is a full example of how you might pull data from Twitter, but all of the code is commented out since the token is likely to expired by the time this document is read. Instead, for the next section, pre-downloaded files will be used.

#library(academictwitteR)     #This package allows R to interface easily with the Twitter API

#bearer_token <- "AAAAAAAAAAAAAAAAAAAAAByFhgEAAAAAl4Ql0CyArwygcl0FewUO5ARSaAQ%3DvJpyDIOBKBVdtgnEv7wrpdwoeI$wbQLbxfSYElUGZuwix0kbQm"

#tweets20US <-
# get_all_tweets(
#   'covid-19 has:geo,        #This part requires any pulled tweets to include geolocation data
#   '2020-06-01T01:00:002',   #This part specifies when to begin pulling tweets
#   '2020-06-08T01:00:002',   #This part specifies when to stop pulling tweets
#   bearer_token,             #This is the token that we stored from above
#   n=5000,                   #This is the number of tweets to pull
#   country='US'              #And here is the country we want to pull from
# )

#tweets21US <-
# get_all_tweets(
#   'covid-19 has:geo,
#   '2021-01-01T01:00:002',
#   '2021-01-08T01:00:002',
#   bearer_token,
#   n=5000,
#   country='US'
# )

#tweets22US <-
# get_all_tweets(
#   'covid-19 has:geo,
#   '2022-01-01T01:00:002',
#   '2022-01-08T01:00:002',
#   bearer_token,
#   n=5000,
#   country='US'
# )

In the next section, we’ll look at data from the above year ranges, as well as data from Great Britain and Canada.

Twitter Sentiment Analysis

First things first, we need to load our data & required libraries:

library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
library(tidytext)
library(stringr)
library(tidyr)
library(textdata)
library(reshape2)
library(scales)

tweets20CA <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets20CA.csv', row.names=1)
tweets21CA <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets21CA.csv', row.names=1)
tweets22CA <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets22CA.csv', row.names=1)

tweets20GB <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets20GB.csv', row.names=1)
tweets21GB <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets21GB.csv', row.names=1)
tweets22GB <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets22GB.csv', row.names=1)

tweets20US <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets20us.csv', row.names=1)
tweets21US <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets21us.csv', row.names=1)
tweets22US <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets22us.csv', row.names=1)

Next we’ll combine our data for each country, sorting by year:

tweetsUS <- bind_rows(tweets21US %>%
                        mutate(year='y2021'),
                      tweets22US %>%
                        mutate(year='y2022'),
                      tweets20US %>%
                        mutate(year='y2020')) %>%
  mutate(timestamp=ymd_hms(created_at))

tweetsGB <- bind_rows(tweets21GB %>%
                        mutate(year='y2021'),
                      tweets22GB %>%
                        mutate(year='y2022'),
                      tweets20GB %>%
                        mutate(year='y2020')) %>%
  mutate(timestamp=ymd_hms(created_at))

tweetsCA <- bind_rows(tweets21CA %>%
                        mutate(year='y2021'),
                      tweets22CA %>%
                        mutate(year='y2022'),
                      tweets20CA %>%
                        mutate(year='y2020')) %>%
  mutate(timestamp=ymd_hms(created_at))

And now we’ll begin cleaning our data. We’ll remove a bunch of messy annotation artifacts, as well as stop words to keep only the sentiment-relevant components.

remove_reg <- '&amp;|&lt;|&gt;'

tidy_tweetsUS <- tweetsUS %>%
  filter(!str_detect(text, '^RT')) %>%
  mutate(text=str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token='tweets') %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, '[a-z]'))

tidy_tweetsGB <- tweetsGB %>%
  filter(!str_detect(text, '^RT')) %>%
  mutate(text=str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token='tweets') %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, '[a-z]'))

tidy_tweetsCA <- tweetsCA %>%
  filter(!str_detect(text, '^RT')) %>%
  mutate(text=str_remove_all(text, remove_reg)) %>%
  unnest_tokens(word, text, token='tweets') %>%
  filter(!word %in% stop_words$word,
         !word %in% str_remove_all(stop_words$word, "'"),
         str_detect(word, '[a-z]'))

Now let’s visualize our data using bar graphs.

ggplot(tweetsUS, aes(x=timestamp, fill=year)) +
  geom_histogram(position='identity', bins=20, show.legend=FALSE) +
  facet_wrap(~year, ncol=1)

ggplot(tweetsGB, aes(x=timestamp, fill=year)) +
  geom_histogram(position='identity', bins=20, show.legend=FALSE) +
  facet_wrap(~year, ncol=1)

ggplot(tweetsCA, aes(x=timestamp, fill=year)) +
  geom_histogram(position='identity', bins=20, show.legend=FALSE) +
  facet_wrap(~year, ncol=1)

Let’s view the word frequencies for our countries.

frequencyUS <- tidy_tweetsUS %>%
  count(year, word, sort=TRUE) %>%
  left_join(tidy_tweetsUS %>%
              count(year, name='total')) %>%
  mutate(freq=n/total)
## Joining, by = "year"
frequencyGB <- tidy_tweetsGB %>%
  count(year, word, sort=TRUE) %>%
  left_join(tidy_tweetsGB %>%
              count(year, name='total')) %>%
  mutate(freq=n/total)
## Joining, by = "year"
frequencyCA <- tidy_tweetsCA %>%
  count(year, word, sort=TRUE) %>%
  left_join(tidy_tweetsCA %>%
              count(year, name='total')) %>%
  mutate(freq=n/total)
## Joining, by = "year"
head(frequencyUS)
##    year     word    n total        freq
## 1 y2021    covid 4431 66977 0.066157039
## 2 y2020    covid 4339 67422 0.064355848
## 3 y2022    covid 3157 57060 0.055327725
## 4 y2022 #covid19  894 57060 0.015667718
## 5 y2021  vaccine  662 66977 0.009883990
## 6 y2021   people  458 66977 0.006838168
head(frequencyGB)
##    year     word    n total        freq
## 1 y2020    covid 3757 71255 0.052726124
## 2 y2021    covid 3274 53897 0.060745496
## 3 y2020 #covid19  889 71255 0.012476317
## 4 y2022    covid  782 17402 0.044937364
## 5 y2020   people  556 71255 0.007802961
## 6 y2021 #covid19  442 53897 0.008200828
head(frequencyCA)
##    year     word    n total       freq
## 1 y2020    covid 2419 48348 0.05003309
## 2 y2021    covid 1702 31882 0.05338435
## 3 y2022    covid  788 17302 0.04554387
## 4 y2020   stigma  559 48348 0.01156201
## 5 y2020   health  438 48348 0.00905932
## 6 y2020 #covid19  435 48348 0.00899727

Let’s now get the frequency by year for each word.

frequencyUS2 <- frequencyUS %>%
  select(year, word, freq) %>%
  pivot_wider(names_from=year, values_from=freq) %>%
  arrange('y2020', 'y2021', 'y2022')

frequencyGB2 <- frequencyGB %>%
  select(year, word, freq) %>%
  pivot_wider(names_from=year, values_from=freq) %>%
  arrange('y2020', 'y2021', 'y2022')

frequencyCA2 <- frequencyCA %>%
  select(year, word, freq) %>%
  pivot_wider(names_from=year, values_from=freq) %>%
  arrange('y2020', 'y2021', 'y2022')

head(frequencyUS2)
## # A tibble: 6 × 4
##   word       y2021    y2020   y2022
##   <chr>      <dbl>    <dbl>   <dbl>
## 1 covid    0.0662  0.0644   0.0553 
## 2 #covid19 0.00527 0.00577  0.0157 
## 3 vaccine  0.00988 0.000519 0.00435
## 4 people   0.00684 0.00662  0.00605
## 5 testing  0.00148 0.00191  0.00619
## 6 de       0.00337 0.00393  0.00547
head(frequencyGB2)
## # A tibble: 6 × 4
##   word            y2020   y2021   y2022
##   <chr>           <dbl>   <dbl>   <dbl>
## 1 covid       0.0527    0.0607  0.0449 
## 2 #covid19    0.0125    0.00820 0.0220 
## 3 people      0.00780   0.00759 0.00684
## 4 vaccine     0.000463  0.00727 0.00471
## 5 uk          0.00547   0.00638 0.00356
## 6 vaccination 0.0000140 0.00505 0.00218
head(frequencyCA2)
## # A tibble: 6 × 4
##   word       y2020   y2021    y2022
##   <chr>      <dbl>   <dbl>    <dbl>
## 1 covid    0.0500  0.0534  0.0455  
## 2 stigma   0.0116  0.00982 0.000462
## 3 health   0.00906 0.0127  0.00520 
## 4 #covid19 0.00900 0.00489 0.0145  
## 5 fighting 0.00819 0.00910 0.000520
## 6 de       0.00780 0.00794 0.00647

Let’s visualize these frequencies by word. Note that you could make this more readable if you reduced the number of words being analyzed.

ggplot(frequencyUS2, aes(2020, 2022)) +
  geom_jitter(alpha=0.05, size=2.5, width=0.25, height=0.25) +
  geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
  scale_x_log10(labels=percent_format()) +
  scale_y_log10(labels=percent_format()) +
  geom_abline(color='forestgreen')

Let’s now try filtering our US data by nrc’s “joy” coded words.

nrc_joy <- get_sentiments('nrc') %>%
  filter(sentiment == 'joy')

tidy_tweetsUS %>%
  filter(year=='y2020') %>%
  inner_join(nrc_joy) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
##              word  n
## 1            safe 93
## 2           money 64
## 3            hope 56
## 4            food 50
## 5             god 45
## 6            love 43
## 7            vote 41
## 8           share 36
## 9           happy 32
## 10           true 32
## 11       football 27
## 12          found 27
## 13       peaceful 27
## 14            pay 26
## 15      resources 26
## 16         church 25
## 17         friend 25
## 18         rising 23
## 19          proud 21
## 20         mother 20
## 21        excited 19
## 22        feeling 19
## 23        respect 19
## 24          beach 18
## 25     graduation 18
## 26          treat 18
## 27      beautiful 17
## 28        finally 17
## 29           pray 17
## 30         pretty 17
## 31           deal 16
## 32            fun 16
## 33           hero 16
## 34        special 16
## 35       birthday 15
## 36       favorite 15
## 37      providing 14
## 38          alive 13
## 39         create 13
## 40         income 13
## 41           star 13
## 42        success 13
## 43           baby 12
## 44          enjoy 12
## 45          peace 12
## 46          youth 12
## 47          bless 11
## 48        closure 11
## 49      excellent 11
## 50           glad 11
## 51           luck 11
## 52           save 11
## 53            sun 11
## 54      wonderful 11
## 55           beer 10
## 56          pride 10
## 57           cash  9
## 58    celebration  9
## 59       ceremony  9
## 60          child  9
## 61           jump  9
## 62        perfect  9
## 63         praise  9
## 64        advance  8
## 65          clean  8
## 66      encourage  8
## 67       thankful  8
## 68          award  7
## 69          faith  7
## 70        freedom  7
## 71          grant  7
## 72       humanity  7
## 73       daughter  6
## 74       friendly  6
## 75         garden  6
## 76        healing  6
## 77            joy  6
## 78          labor  6
## 79       progress  6
## 80            art  5
## 81     basketball  5
## 82    celebrating  5
## 83        content  5
## 84          dance  5
## 85       equality  5
## 86       generous  5
## 87          green  5
## 88        helpful  5
## 89        improve  5
## 90       laughing  5
## 91         loving  5
## 92        miracle  5
## 93          music  5
## 94          saint  5
## 95      supremacy  5
## 96       thriving  5
## 97       vacation  5
## 98         wealth  5
## 99       advocacy  4
## 100         birth  4
## 101       blessed  4
## 102    celebrated  4
## 103       charity  4
## 104     delicious  4
## 105          gain  4
## 106          heal  4
## 107     heartfelt  4
## 108         laugh  4
## 109      majority  4
## 110  organization  4
## 111       pleased  4
## 112      powerful  4
## 113       promise  4
## 114    successful  4
## 115      surprise  4
## 116       winning  4
## 117  appreciation  3
## 118        beauty  3
## 119      blessing  3
## 120         bonus  3
## 121     cathedral  3
## 122     childhood  3
## 123       comfort  3
## 124       endless  3
## 125      festival  3
## 126        freely  3
## 127          grow  3
## 128  inauguration  3
## 129        parade  3
## 130    retirement  3
## 131         score  3
## 132       smiling  3
## 133           spa  3
## 134         sweet  3
## 135          swim  3
## 136    unexpected  3
## 137     visionary  3
## 138       visitor  3
## 139     volunteer  3
## 140        weight  3
## 141  accomplished  2
## 142       approve  2
## 143        candid  2
## 144      cheering  2
## 145     closeness  2
## 146         cream  2
## 147         elite  2
## 148       engaged  2
## 149      enjoying  2
## 150 entertainment  2
## 151      exciting  2
## 152       fitting  2
## 153         glory  2
## 154      gorgeous  2
## 155 grandchildren  2
## 156       holiday  2
## 157       hopeful  2
## 158           hug  2
## 159   improvement  2
## 160      inspired  2
## 161    intimately  2
## 162        invite  2
## 163       journey  2
## 164        joyous  2
## 165         kudos  2
## 166        liquor  2
## 167         lucky  2
## 168     marvelous  2
## 169    passionate  2
## 170     practiced  2
## 171        ribbon  2
## 172        salute  2
## 173           sex  2
## 174      shopping  2
## 175       spirits  2
## 176       succeed  2
## 177      treasure  2
## 178       victory  2
## 179         wages  2
## 180      welcomed  2
## 181        winner  2
## 182     abundance  1
## 183    accomplish  1
## 184     adoration  1
## 185          amen  1
## 186     amusement  1
## 187       banquet  1
## 188     blessings  1
## 189         bloom  1
## 190        bridal  1
## 191     brilliant  1
## 192         buddy  1
## 193         chant  1
## 194    charitable  1
## 195     chocolate  1
## 196         clown  1
## 197     communion  1
## 198    compliment  1
## 199    confidence  1
## 200     confident  1
## 201          dawn  1
## 202       embrace  1
## 203     enlighten  1
## 204  entertaining  1
## 205        erotic  1
## 206        exceed  1
## 207         fancy  1
## 208          gift  1
## 209        giggle  1
## 210       glimmer  1
## 211     gratitude  1
## 212     happiness  1
## 213     healthful  1
## 214        heroic  1
## 215        heyday  1
## 216     hilarious  1
## 217          hire  1
## 218        honest  1
## 219  humanitarian  1
## 220  independence  1
## 221   inspiration  1
## 222       inspire  1
## 223  intelligence  1
## 224          kiss  1
## 225      liberate  1
## 226       liberty  1
## 227         loyal  1
## 228        luxury  1
## 229      majestic  1
## 230      marriage  1
## 231         marry  1
## 232       massage  1
## 233         medal  1
## 234      ministry  1
## 235       musical  1
## 236       notable  1
## 237         opera  1
## 238     orchestra  1
## 239   outstanding  1
## 240       passion  1
## 241        pastor  1
## 242      pleasant  1
## 243       praised  1
## 244      precious  1
## 245    privileged  1
## 246     readiness  1
## 247     receiving  1
## 248    recreation  1
## 249       rejoice  1
## 250         repay  1
## 251  reproductive  1
## 252        rescue  1
## 253         revel  1
## 254        reward  1
## 255       romance  1
## 256        salary  1
## 257     satisfied  1
## 258   scholarship  1
## 259       shining  1
## 260         silly  1
## 261         smile  1
## 262        spouse  1
## 263         sunny  1
## 264      sunshine  1
## 265      superman  1
## 266     supporter  1
## 267        sweets  1
## 268      symphony  1
## 269         teach  1
## 270        tender  1
## 271   tranquility  1
## 272        trophy  1
## 273       undying  1
## 274         vivid  1
## 275   wonderfully  1
## 276       worship  1

We can do the same with “anger” coded words. This time we’ll use our data from Britain.

nrc_anger <- get_sentiments('nrc') %>%
  filter(sentiment=='anger')

tidy_tweetsGB %>%
  filter(year=='y2022') %>%
  inner_join(nrc_anger) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
##               word  n
## 1            death 26
## 2              bad 18
## 3              jab 15
## 4          disease 12
## 5          feeling  9
## 6              hit  8
## 7           cancer  7
## 8          deserve  7
## 9            dying  7
## 10         killing  7
## 11           money  7
## 12          bloody  6
## 13           fight  6
## 14         poverty  6
## 15           smell  6
## 16          deadly  5
## 17            fear  5
## 18          losing  5
## 19        politics  5
## 20            shit  5
## 21            shot  5
## 22           treat  5
## 23           words  5
## 24          attack  4
## 25           awful  4
## 26           blame  4
## 27           court  4
## 28      disruption  4
## 29        fighting  4
## 30           lying  4
## 31             mad  4
## 32       mortality  4
## 33         opposed  4
## 34          resign  4
## 35          threat  4
## 36         adverse  3
## 37          broken  3
## 38           clash  3
## 39          damage  3
## 40            hate  3
## 41   insignificant  3
## 42            loss  3
## 43      opposition  3
## 44           shock  3
## 45         anxiety  2
## 46        argument  2
## 47         assault  2
## 48           chaos  2
## 49          combat  2
## 50         cutting  2
## 51           delay  2
## 52            deny  2
## 53        disagree  2
## 54        disaster  2
## 55         enforce  2
## 56           harry  2
## 57        horrible  2
## 58            kick  2
## 59      misleading  2
## 60       morbidity  2
## 61           react  2
## 62        shortage  2
## 63           shout  2
## 64            sore  2
## 65           storm  2
## 66          victim  2
## 67           youth  2
## 68       abandoned  1
## 69           abuse  1
## 70         accused  1
## 71      aggression  1
## 72           armed  1
## 73        arrogant  1
## 74            bear  1
## 75           bitch  1
## 76     campaigning  1
## 77         caution  1
## 78       celebrity  1
## 79       challenge  1
## 80       communism  1
## 81     confinement  1
## 82       confusion  1
## 83        contempt  1
## 84           crazy  1
## 85        criminal  1
## 86            defy  1
## 87          demand  1
## 88         despair  1
## 89      diabolical  1
## 90        disgrace  1
## 91     disgraceful  1
## 92  disinformation  1
## 93       displaced  1
## 94      disrespect  1
## 95        epidemic  1
## 96        escalate  1
## 97            evil  1
## 98           exile  1
## 99         failing  1
## 100            fee  1
## 101          fraud  1
## 102     frustrated  1
## 103           grab  1
## 104           grim  1
## 105         guilty  1
## 106            hot  1
## 107       insanity  1
## 108         insult  1
## 109       invasion  1
## 110        limited  1
## 111         lonely  1
## 112           lose  1
## 113        lunatic  1
## 114        madness  1
## 115   manslaughter  1
## 116         misery  1
## 117        mocking  1
## 118         morals  1
## 119            mug  1
## 120        offense  1
## 121      offensive  1
## 122    persecution  1
## 123     pretending  1
## 124          prick  1
## 125         prison  1
## 126      psychosis  1
## 127          punch  1
## 128           rage  1
## 129           rail  1
## 130         rating  1
## 131     resentment  1
## 132        retract  1
## 133       ridicule  1
## 134        sarcasm  1
## 135         savage  1
## 136       sentence  1
## 137      socialist  1
## 138         strike  1
## 139       terrible  1
## 140          thief  1
## 141        traitor  1
## 142        violent  1

Lastly, let’s try “fear” coded words for Canada.

nrc_fear <- get_sentiments('nrc') %>%
  filter(sentiment == 'fear')

tidy_tweetsCA %>%
  filter(year=='y2020') %>%
  inner_join(nrc_fear) %>%
  count(word, sort=TRUE)
## Joining, by = "word"
##               word   n
## 1           stigma 559
## 2         pandemic 228
## 3            death  87
## 4          medical  77
## 5        emergency  65
## 6       government  64
## 7         hospital  52
## 8             risk  49
## 9            fight  44
## 10          change  37
## 11         disease  27
## 12            fear  26
## 13           spike  21
## 14           force  19
## 15         prevent  19
## 16           watch  18
## 17             die  16
## 18  discrimination  16
## 19           avoid  15
## 20             bad  15
## 21       infection  15
## 22      quarantine  15
## 23    surveillance  15
## 24          police  14
## 25         remains  14
## 26       challenge  13
## 27           delay  13
## 28           youth  13
## 29             god  12
## 30         illness  11
## 31          deadly  10
## 32             flu  10
## 33     confinement   9
## 34           crazy   9
## 35           dying   9
## 36      infectious   9
## 37       concerned   8
## 38       difficult   8
## 39        epidemic   8
## 40         feeling   8
## 41      inequality   8
## 42            lose   8
## 43          prison   8
## 44            warn   8
## 45           worry   8
## 46          afraid   7
## 47         caution   7
## 48       dangerous   7
## 49        disaster   7
## 50         poverty   7
## 51         warning   7
## 52           worse   7
## 53           abuse   6
## 54          forced   6
## 55           lines   6
## 56           treat   6
## 57             war   6
## 58          cancer   5
## 59           court   5
## 60          danger   5
## 61         failure   5
## 62            fire   5
## 63         hearing   5
## 64        struggle   5
## 65        surprise   5
## 66        violence   5
## 67          warden   5
## 68         abandon   4
## 69         advance   4
## 70           alarm   4
## 71          combat   4
## 72      confidence   4
## 73       contagion   4
## 74      contagious   4
## 75        disabled   4
## 76           doubt   4
## 77         enforce   4
## 78           guard   4
## 79            khan   4
## 80            loss   4
## 81        military   4
## 82          murder   4
## 83         nervous   4
## 84      retirement   4
## 85            rule   4
## 86          unsafe   4
## 87          urgent   4
## 88          warned   4
## 89         worship   4
## 90         anxiety   3
## 91            bear   3
## 92       brutality   3
## 93           burke   3
## 94      compassion   3
## 95       confusion   3
## 96      conspiracy   3
## 97           cross   3
## 98       diagnosis   3
## 99          dragon   3
## 100    examination   3
## 101     graduation   3
## 102           hate   3
## 103        illegal   3
## 104     insecurity   3
## 105         lawyer   3
## 106        missing   3
## 107        painful   3
## 108          polio   3
## 109         stroke   3
## 110      suffering   3
## 111     suspension   3
## 112       terrible   3
## 113      volunteer   3
## 114        accused   2
## 115     apparition   2
## 116        assault   2
## 117         beware   2
## 118          bitch   2
## 119          blues   2
## 120          broke   2
## 121         broken   2
## 122           cash   2
## 123          chaos   2
## 124   contaminated   2
## 125          cruel   2
## 126        crushed   2
## 127       dementia   2
## 128     destroying   2
## 129   dictatorship   2
## 130 disinformation   2
## 131     disruption   2
## 132        failing   2
## 133        fearful   2
## 134          fever   2
## 135          flood   2
## 136        fragile   2
## 137           gore   2
## 138        hanging   2
## 139        hunting   2
## 140         injury   2
## 141   intelligence   2
## 142        journey   2
## 143           kill   2
## 144        killing   2
## 145          loyal   2
## 146  manifestation   2
## 147      mortality   2
## 148           pain   2
## 149           plea   2
## 150      procedure   2
## 151       punitive   2
## 152         remove   2
## 153          shame   2
## 154          shoot   2
## 155       shooting   2
## 156           shot   2
## 157         shrink   2
## 158        suicide   2
## 159        suspect   2
## 160       threaten   2
## 161        tragedy   2
## 162     unexpected   2
## 163        unknown   2
## 164         unruly   2
## 165           wary   2
## 166         weight   2
## 167      abandoned   1
## 168       accident   1
## 169      aftermath   1
## 170      agonizing   1
## 171       alarming   1
## 172      ambulance   1
## 173        anarchy   1
## 174          armed   1
## 175         attack   1
## 176       attorney   1
## 177        autopsy   1
## 178       avoiding   1
## 179          awful   1
## 180       bacteria   1
## 181        beating   1
## 182        bigoted   1
## 183          birth   1
## 184         bloody   1
## 185    bombardment   1
## 186         brutal   1
## 187           buck   1
## 188          bully   1
## 189         burial   1
## 190          cabal   1
## 191       cautious   1
## 192       cemetery   1
## 193          cliff   1
## 194      concealed   1
## 195       conflict   1
## 196         coward   1
## 197          crash   1
## 198       criminal   1
## 199        cyclone   1
## 200           dart   1
## 201        default   1
## 202         defend   1
## 203        defense   1
## 204    deleterious   1
## 205         demise   1
## 206      dentistry   1
## 207    destination   1
## 208      destroyed   1
## 209  deterioration   1
## 210    devastating   1
## 211           dire   1
## 212      disappear   1
## 213     disastrous   1
## 214     discipline   1
## 215     discourage   1
## 216     disgusting   1
## 217      dismissal   1
## 218     distressed   1
## 219        divorce   1
## 220       dominant   1
## 221        endless   1
## 222          erase   1
## 223       exorcism   1
## 224      explosive   1
## 225       fatality   1
## 226           fled   1
## 227         flying   1
## 228      forgotten   1
## 229           gang   1
## 230          giant   1
## 231          grave   1
## 232            gun   1
## 233           harm   1
## 234           hide   1
## 235         hiding   1
## 236      holocaust   1
## 237       homeless   1
## 238         honest   1
## 239       horrible   1
## 240       horrific   1
## 241        horrors   1
## 242        hostage   1
## 243           hurt   1
## 244      immigrant   1
## 245  incarceration   1
## 246         inmate   1
## 247         insane   1
## 248       insanity   1
## 249       insecure   1
## 250           jail   1
## 251        lawsuit   1
## 252        leprosy   1
## 253           loom   1
## 254          lynch   1
## 255            mad   1
## 256        madness   1
## 257           mage   1
## 258   manipulation   1
## 259        measles   1
## 260            mob   1
## 261       mortgage   1
## 262       newcomer   1
## 263      operation   1
## 264       opponent   1
## 265        opposed   1
## 266       ordnance   1
## 267          panic   1
## 268       paranoia   1
## 269           pare   1
## 270        penalty   1
## 271         plight   1
## 272      pneumonia   1
## 273       poisoned   1
## 274       powerful   1
## 275           pray   1
## 276         punish   1
## 277      punishing   1
## 278            rat   1
## 279         ravine   1
## 280      recession   1
## 281       reckless   1
## 282     regulatory   1
## 283        rejects   1
## 284      reluctant   1
## 285         resign   1
## 286           riot   1
## 287            sag   1
## 288        scandal   1
## 289          scare   1
## 290      screaming   1
## 291       sentence   1
## 292      sickening   1
## 293            sin   1
## 294       sinister   1
## 295           slam   1
## 296          slave   1
## 297          stint   1
## 298      supremacy   1
## 299        surgery   1
## 300          sweat   1
## 301      terrorist   1
## 302          theft   1
## 303         threat   1
## 304    threatening   1
## 305          troll   1
## 306        turmoil   1
## 307      uncertain   1
## 308     unemployed   1
## 309         uphill   1
## 310       vanished   1
## 311          venom   1
## 312       vigilant   1
## 313        violent   1
## 314       worrying   1
## 315          wrath   1
## 316     xenophobia   1

Next, let’s use sentiments from bing to get an idea of how sentiment changed between 2020, 2021, and 2022 in COVID tweets. We’ll graph them on a 3-facted histogram.

GB_covid_sentiment <- tidy_tweetsGB %>%
  inner_join(get_sentiments('bing')) %>%
  count(year, index=id %/% 80, sentiment) %>%
  pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
  mutate(sentiment=positive-negative)
ggplot(GB_covid_sentiment, aes(sentiment, fill=year)) +
  geom_histogram(show.legend=FALSE) +
  facet_wrap(~year, ncol=2, scales='free_x')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s pull the sentiment scores from these graphs for each year:

GB_covid_sentiment_means <- aggregate(x=GB_covid_sentiment$sentiment, by=list(GB_covid_sentiment$year), FUN=mean)

GB_covid_sentiment_means
##   Group.1          x
## 1   y2020 -0.5423032
## 2   y2021 -0.3636364
## 3   y2022 -0.4652062

Now we’ll do all that we did with our GB data above with our US and CA data

US_covid_sentiment <- tidy_tweetsUS %>%
  inner_join(get_sentiments('bing')) %>%
  count(year, index=id %/% 80, sentiment) %>%
  pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
  mutate(sentiment=positive-negative)

ggplot(US_covid_sentiment, aes(sentiment, fill=year)) +
  geom_histogram(show.legend=FALSE) +
  facet_wrap(~year, ncol=2, scales='free_x')

US_covid_sentiment_means <- aggregate(x=US_covid_sentiment$sentiment, by=list(US_covid_sentiment$year), FUN=mean)

US_covid_sentiment_means
##   Group.1          x
## 1   y2020 -0.6611008
## 2   y2021 -0.6183871
## 3   y2022 -0.5228113
CA_covid_sentiment <- tidy_tweetsCA %>%
  inner_join(get_sentiments('bing')) %>%
  count(year, index=id %/% 80, sentiment) %>%
  pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
  mutate(sentiment=positive-negative)

ggplot(CA_covid_sentiment, aes(sentiment, fill=year)) +
  geom_histogram(show.legend=FALSE) +
  facet_wrap(~year, ncol=2, scales='free_x')

CA_covid_sentiment_means <- aggregate(x=CA_covid_sentiment$sentiment, by=list(CA_covid_sentiment$year), FUN=mean)

CA_covid_sentiment_means
##   Group.1          x
## 1   y2020 -0.6140260
## 2   y2021 -0.6931007
## 3   y2022 -0.3765244

Using bing, we can analyze individual postive/negative attributions to check for major outliers. For example, trump is listed as a positive word in bing, but we know that it likely instead refers to a person rather than any sentiment particularly. Furthermore, positive is being treated as a positive word, when it could pertain to a positive covid test - a decidedly negative subject. A custom library for filtering can remove those skewing words.

bing_word_counts_US <- tidy_tweetsUS %>%
  inner_join(get_sentiments('bing')) %>%
  count(word, sentiment, sort=TRUE) %>%
  ungroup()

head(bing_word_counts_US)
##       word sentiment   n
## 1 positive  positive 613
## 2    trump  positive 468
## 3    virus  negative 425
## 4    death  negative 306
## 5     died  negative 297
## 6     safe  positive 241

Here’s a graphical representation of our data:

bing_word_counts_US %>%
  group_by(sentiment) %>%
  slice_max(n, n=10) %>%
  ungroup() %>%
  mutate(word=reorder(word, n)) %>%
  ggplot(aes(n, word, fill=sentiment)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~sentiment, scales='free_y') +
  labs(x='Contribution to Sentiment', y=NULL)

Let’s now make a list of custom stop words to filter our positive sentiment data:

custom_stopwords <- bind_rows(tibble(word=c('trump','positive'),
                                     lexicon=c('custom')),
                              stop_words)

custom_stopwords
## # A tibble: 1,151 × 2
##    word        lexicon
##    <chr>       <chr>  
##  1 trump       custom 
##  2 positive    custom 
##  3 a           SMART  
##  4 a's         SMART  
##  5 able        SMART  
##  6 about       SMART  
##  7 above       SMART  
##  8 according   SMART  
##  9 accordingly SMART  
## 10 across      SMART  
## # … with 1,141 more rows

Let’s next look into Word Clouds. First, we’ll graph a simple word cloud with larger words denoting more common words. Then, we’ll do the same but instead map sentiment to a color gradient.

library(wordcloud)

tidy_tweetsUS %>%
  anti_join(custom_stopwords) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words=75))

tidy_tweetsUS %>%
  anti_join(custom_stopwords) %>%
  inner_join(get_sentiments('bing')) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word~sentiment, value.var='n', fill=0) %>%
  comparison.cloud(colors=c('forestgreen','magenta','cyan'), max.words=70)

And now, let’s get counts of how many negative words appear compared to the number of total words, and our ratio of negative words. This can better illustrate the differences in negativity between years, since they have different word counts.

bingnegative <- get_sentiments('bing') %>%
  filter(sentiment=='negative')

wordcounts <- tidy_tweetsCA %>%
  group_by(year) %>%
  summarize(words=n())

tidy_tweetsCA %>%
  semi_join(bingnegative) %>%
  group_by(year) %>%
  summarize(negativewords=n()) %>%
  left_join(wordcounts, by='year') %>%
  mutate(ratio=negativewords/words) %>%
  slice_max(ratio, n=3) %>%
  ungroup()
## # A tibble: 3 × 4
##   year  negativewords words  ratio
##   <chr>         <int> <int>  <dbl>
## 1 y2020          2711 48348 0.0561
## 2 y2021          1715 31882 0.0538
## 3 y2022           753 17302 0.0435