In this document, I’ll be going over a series of techniques in making data presentable and readable for audiences. After all, a major portion of data science is ensuring that your intended targets can understand and interpret the analysis without issue. Over the course of this document, we will cover the usage and techniques of the following:
GGPlot (ggplot2) is a very useful package that allows the graphing of all kinds of data using a number of methods and parameters to customize the presentation of your data. The above tabs will take you to vignettes demonstrating a number of useful graphing techniques.
We’ll start with making a dataframe based R’s in-built ToothGrowth data set. This is a dummy dataset that will just serve to illustrate the methods for a barplot. Note that we’re also loading ggplot2, the package we’ll be using. We’ll also load dplyr, a generally useful package for readable code and organization.
library(ggplot2)
library(dplyr)
df <- data.frame(dose = c('D0.5','D1','D2'),
len = c(4.2, 10, 29.5))
df
## dose len
## 1 D0.5 4.2
## 2 D1 10.0
## 3 D2 29.5
Next, let’s make a second dataframe. In this one, we’re including a third variable; supplement. VC is vitamin C, and OJ is orange juice.
df2 <- data.frame(supp = rep(c('VC','OJ'), each = 3),
dose = rep(c('D0.5','D1','D2'), 2),
len = c(6.8,15,33,4.2,10,29.5))
df2
## supp dose len
## 1 VC D0.5 6.8
## 2 VC D1 15.0
## 3 VC D2 33.0
## 4 OJ D0.5 4.2
## 5 OJ D1 10.0
## 6 OJ D2 29.5
Here, we’ll set the appropriate parameters for ggplot’s visuals.
theme_set(
theme_classic() +
theme(legend.position = 'top')
)
We’ll start with some basic barplots using the tooth data we’ve set up. We’ll create variable f, which will be our ggplot object. We will then call the geom_col() facet to plot it.
f <- ggplot(df, aes(x=dose,y=len))
f + geom_col()
Now let’s try changing the fill and adding labels to the top.
f + geom_col(fill='darkblue') +
geom_text(aes(label=len, vjust=-0.3))
Now let’s add the labels inside the bars.
f + geom_col(fill='darkblue') +
geom_text(aes(label=len),vjust=1.6, color='white')
Next, we’ll change the barplot colors by group.
f + geom_col(aes(color=dose), fill='white') +
scale_color_manual(values=c('blue','gold','red'))
This isn’t very readable, so let’s adjust the fill color.
f + geom_col(aes(fill=dose)) +
scale_fill_manual(values=c('blue','gold','red'))
Suppose we have a dataset like df2 which has two treatments groups. How do we graph that in a barplot? We can use color=supp to color our two bars separately, then position=position_stack() to place each group atop each other in a stacked barplot.
ggplot(df2, aes(x=dose,y=len)) +
geom_col(aes(color=supp, fill=supp), position=position_stack()) +
scale_color_manual(values=c('blue','gold')) +
scale_fill_manual(values=c('blue','gold'))
What if we want the bars next to each other instead of stacked? Note that we also can save our graphs into a unique variable - for example, p - then call the variable to print the formatted plot.
p <- ggplot(df2, aes(x=dose,y=len)) +
geom_col(aes(color=supp,fill=supp), position=position_dodge(0.8), width=0.7) +
scale_color_manual(values=c('blue','gold')) +
scale_fill_manual(values=c('blue','gold'))
p
Let’s now add labels to the dodged barplot. Note that position_dodge has been slightly adjusted, moving the labels slightly off center.
p + geom_text(
aes(label=len,group=supp),
position=position_dodge(0.85),
vjust=-0.3,size=3.5
)
Alternatively, what if we want labels for our stacked barplots? To do this, we’ll need to use dplyr. Below, we use dplyr to add another column that stores label position values. Note that the %>% symbol denotes “piping down” and is a function of dplyr.
df2 <- df2 %>%
group_by(dose) %>%
arrange(dose, desc(supp)) %>%
mutate(lab_ypos=cumsum(len)-0.5*len)
We’ll now recreate our stacked graphs. We’ll also modify the color of the text to instead be a collection so our gold bars and blue bars have more readable text.
ggplot(df2, aes(x=dose,y=len)) +
geom_col(aes(fill=supp), width=0.7) +
geom_text(aes(y=lab_ypos,label=len,group=supp),color=c('black','white','black','white','black','white')) +
scale_color_manual(values=c('blue','gold')) +
scale_fill_manual(values=c('blue','gold'))
Let’s now look at boxplots. This time, we’re going to use the full ToothGrowth dataset built into R.
library(ggplot2)
data('ToothGrowth')
Let’s change the dose to a factor, then look at the top of the dataframe.
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
head(ToothGrowth, 4)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
Let’s set the theme for our plots.
theme_set(
theme_dark() +
theme(legend.position='top')
)
And now we’ll begin with a very basic boxplot of dose vs. length.
tg <- ggplot(ToothGrowth, aes(x=dose,y=len))
tg + geom_boxplot()
Next, we’ll add points for the means of the data and try out a notched boxplot.
tg + geom_boxplot(notch=TRUE, fill='lightgray') +
stat_summary(fun=mean, geom='point', shape=18, size=2.5, color='indianred')
It’s also possible to adjust the scale number of variables included, and their order. Below, we’ll cull out the data for dose=1.
tg + geom_boxplot() +
scale_x_discrete(limits=c('0.5','2'))
## Warning: Removed 20 rows containing missing values (stat_boxplot).
Next, we will adjust the scale to instead go in descending order.
tg + geom_boxplot() +
scale_x_discrete(limits=c('2','1','0.5'))
Boxplot colors can also be adjusted by group.
tg + geom_boxplot(aes(color=dose)) +
scale_color_manual(values=c('indianred','blue1','green2'))
What if we want to display our data subset by OJ vs. Vitamin C?
tg2 <- tg + geom_boxplot(aes(fill=supp), position=position_dodge(0.8)) +
scale_fill_manual(values=c('#E69F00','#999999'))
tg2
If you instead want both plots separate instead of each box being placed adjacent to one another, you can use facet_wrap. We’ll go over facets more in-depth in the Other Plotting Methods section below.
tg2 + facet_wrap(~supp)
Before moving forward, we’re going to set a seed so that our code will reproducibly function for demonstration purposes. We will be setting up a dataframe using randomly generated normal data, hence the usage of a seed.
library(dplyr)
library(ggplot2)
set.seed(1234)
wdata = data.frame(
sex=factor(rep(c('F','M'), each=200)),
weight=c(rnorm(200,56), rnorm(200,58))
)
head(wdata,4)
## sex weight
## 1 F 54.79293
## 2 F 56.27743
## 3 F 57.08444
## 4 F 53.65430
Next, we’ll need summary statistics in order to graph our data. We’ll use dplyr to calculate these.
mu <- wdata %>%
group_by(sex) %>%
summarise(grp.mean=mean(weight))
Now, we’ll set our theme for ggplot.
theme_set(
theme_dark() +
theme(legend.position='bottom')
)
Next, we’ll create a ggplot object using our dataframe.
a <- ggplot(wdata, aes(x=weight))
a + geom_histogram(bins=30,color='black',fill='grey') +
geom_vline(aes(xintercept=mean(weight)),
linetype='dashed', size=0.6)
Next, let’s change the colors so each group is separately identifiable.
a + geom_histogram(aes(color=sex), fill='white', position='identity') +
scale_color_manual(values=c('#00AFBB','#B7B800'))
Now let’s add a fill to the bars.
a + geom_histogram(aes(color=sex, fill=sex), position='identity') +
scale_color_manual(values=c('#00AFBB','#B7B800')) +
scale_fill_manual(values=c('indianred','steelblue'))
What if we want to combine density plots and histograms? Combining plots with GGPlot is as easy as using a ‘+’ sign, actually.
a + geom_histogram(aes(y=stat(density)),
color='black', fill='white') +
geom_density(alpha=0.2, fill='#FF6666')
We can also separate the gender variable to plot two density plots along our histogram.
a + geom_histogram(aes(y=stat(density), color=sex), fill='white', position='identity') +
geom_density(aes(color=sex),size=1) +
scale_color_manual(values=c('indianred','steelblue'))
Lineplots are one of the most ubiquitous and intuitive graph types when used in their intended way. To demonstrate how to make them, we’ll create a simple dataframe like we did for the barplots section:
library(ggplot2)
df <- data.frame(dose=c('D0.5','D1','D2'),
len=c(4.2,10,29.5))
df
## dose len
## 1 D0.5 4.2
## 2 D1 10.0
## 3 D2 29.5
Next we’ll create a second dataframe for plotting by groups.
df2 <- data.frame(supp=rep(c('VC','OJ'),each=3),
dose=rep(c('D0.5','D1','D2'), 2),
len=c(6.8,15,33,4.2,10,29.5))
df2
## supp dose len
## 1 VC D0.5 6.8
## 2 VC D1 15.0
## 3 VC D2 33.0
## 4 OJ D0.5 4.2
## 5 OJ D1 10.0
## 6 OJ D2 29.5
Now, as usual, we’ll set up GGPlot’s theme.
theme_set(
theme_gray() +
theme(legend.position='right')
)
Now let’s get to the actual line plots, starting with basic line plots. First we will build a function to display all the different line types.
generateRLineTypes <- function(){
oldPar <- par()
par(font=2, mar=c(0,0,0,0))
plot(1, pch='',ylim=c(0,6),xlim=c(0,0.7), axes=FALSE, xlab='', ylab='')
for(i in 0:6) lines(c(0.3,0.7), c(i,i), lty=i, lwd=3)
text(rep(0.1,6),0:6, labels=c("0.'blank'", "1.'solid'", "2.'dashed'", "3.'dotted'", "4.'dotdash'", "5.'longdash'", "6.'twodash'"))
par(mar=oldPar$mar, font=oldPar$font)
}
generateRLineTypes()
Now we’ll build our basic line plot.
p <- ggplot(data=df, aes(x=dose, y=len, group=1))
p + geom_line() + geom_point()
Now let’s try modifying the line type and colors.
p + geom_line(linetype='dashed', color='steelblue') +
geom_point(color='indianred')
Next we’ll try a step graph, which indicates a threshold-type progression.
p + geom_step() + geom_point()
Now let’s move on to making multiple groups, starting with creating a new ggplot object.
p2 <- ggplot(df2, aes(x=dose, y=len, group=supp))
Now we’ll change the line types and point shapes by group.
p2 + geom_line(aes(linetype=supp, color=supp)) +
geom_point(aes(shape=supp, color=supp)) +
scale_color_manual(values=c('darkred','darkblue'))
What about line plots with a numeric x axis? Let’s make a new dataframe with the x-axis set up as pure strings for conversion to numeric data.
df3 <- data.frame(supp=rep(c('VC','OJ'), each=3),
dose=rep(c('0.5','1','2'), 2),
len=c(6.8,15,33,4.2,10,29.5))
df3
## supp dose len
## 1 VC 0.5 6.8
## 2 VC 1 15.0
## 3 VC 2 33.0
## 4 OJ 0.5 4.2
## 5 OJ 1 10.0
## 6 OJ 2 29.5
Now we’ll plot such that both axes are treated as continuous labels.
df3$dose <- as.numeric(as.vector(df3$dose))
ggplot(data=df3, aes(x=dose, y=len, group=supp, color=supp)) +
geom_line() + geom_point()
Next we’ll look at a line graph with the x-axis as dates. We’ll use the built in economics time series for this example:
head(economics)
## # A tibble: 6 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
And here’s the plot itself:
ggplot(data=economics, aes(x=date, y=pop)) +
geom_line()
Let’s make some subsets for the data.
ss <- subset(economics, date > as.Date('2006-1-1'))
ggplot(data=ss, aes(x=date, y=pop)) + geom_line()
We can also change the line size contextually based on a secondary variable, such as unemployment.
ggplot(data=economics, aes(x=date, y=pop)) +
geom_line(aes(size=unemploy/pop))
We can also plot multiple time-series data.
ggplot(economics, aes(x=date)) +
geom_line(aes(y=psavert), color='darkred') +
geom_line(aes(y=uempmed), color='steelblue', linetype='twodash')
Lastly, let’s make this into an area plot.
ggplot(economics, aes(x=date)) +
geom_area(aes(y=psavert), fill='#999999',
color = '#999999', alpha = 0.5) +
geom_area(aes(y=uempmed), fill='#E69F00',
color = '#E69F00', alpha = 0.5)
First, we’ll load up our required packages as usual and set our GGPlot theme.
library(ggplot2)
theme_set(
theme_light() +
theme(legend.position='bottom')
)
Now we’ll initiate a ggplot object called TG using the ToothGrowth dataset. Note that we begin by importing our data, then setting dose to be a factor.
data('ToothGrowth')
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
tg <- ggplot(ToothGrowth, aes(x=dose,y=len))
We’ll now create a dotplot with a summary statistic.
tg + geom_dotplot(binaxis='y', stackdir='center', fill='lightgray') +
stat_summary(fun=mean, fun.args=list(mult=1))
Let’s plot both a dotplot and boxplot together.
tg + geom_boxplot(width=0.5) +
geom_dotplot(binaxis='y', stackdir='center',fill='lightgray')
Let’s now add a violin plot to our dotplot, which helps visualize normalness of data.
tg + geom_violin(trim=FALSE) +
geom_dotplot(binaxis='y', stackdir='center', fill='#999999') +
stat_summary(fun=mean, fun.args=list(mult=1))
Now let’s try creating a dotplot with multiple groups specified by color.
tg + geom_boxplot(width=0.5) +
geom_dotplot(aes(fill=supp),binaxis='y', stackdir='center') +
scale_fill_manual(values=c('indianred','lightblue1'))
And finally, just as we could split the bar graphs into separate bars by variable, we can also split our data into two separate histogram/dotplots.
tg + geom_boxplot(aes(color=supp), width=0.5, position=position_dodge(0.8)) +
geom_dotplot(aes(fill=supp, color=supp), binaxis='y', stackdir='center', dotsize=0.8, position=position_dodge(0.8)) +
scale_fill_manual(values=c('orange','green')) +
scale_color_manual(values=c('orange','green'))
Density plots serve as good alternatives to histograms. First, we’ll establsh our dataset from random normal numbers placed into a dataframe.
library(dplyr)
library(ggplot2)
set.seed(376)
wdata = data.frame(
sex=factor(rep(c('F','M'), each=200)),
weight=c(rnorm(200,55),rnorm(200,58))
)
Next, we’ll generate summary statistics for plotting.
mu <- wdata %>%
group_by(sex) %>%
summarise(grp.mean=mean(weight))
Now let’s set our theme for ggplot.
theme_set(
theme_classic() +
theme(legend.position='left')
)
Next we’ll do the basic plotting function, including creating a ggplot object.
d <- ggplot(wdata, aes(x=weight))
d + geom_density() +
geom_vline(aes(xintercept=mean(weight)), linetype='dashed')
Now we’ll change the y axis to counts instead of density.
d + geom_density(aes(y=stat(count)), fill = 'steelblue', alpha=0.5) +
geom_vline(aes(xintercept=mean(weight)), linetype='dashed')
We can also split the single line between our sex variable.
d + geom_density(aes(color=sex)) +
scale_color_manual(values=c('red','blue'))
Now let’s add a fill on the density plots, as well as lines to mark the means of both groups.
d + geom_density(aes(fill=sex), alpha = 0.5) +
geom_vline(aes(xintercept=grp.mean, color=sex), data = mu, linetype='dashed') +
scale_color_manual(values=c('red','blue')) +
scale_fill_manual(values=c('red','blue'))
First, we’ll load our required packages below. Note that we’re loading specifically ggridges for this, as well as viridis. Viridis expands our options for colors and interfaces well with ggplot and ggridges.
library(ggplot2)
library(ggridges)
library(viridis)
library(tidyr)
Next, let’s do a simple ridge plot for density using temperature by month from the airquality dataset built into R.
air <- ggplot(airquality) + aes(Temp, Month, group=Month) + geom_density_ridges()
air
Let’s add some pizazz now. Viridis will be used in the fill and scale_fill_viridis layers below.
ggplot(airquality) + aes(Temp, Month, group=Month, fill=..x..) +
geom_density_ridges_gradient() +
scale_fill_viridis(option='C', name='Temp')
Viridis’ color gradients and scale give us much more readable data. Ridge plotting may not be the best for this data, but it serves as a good template for the sake of example.
The last thing we’ll do is create a facet plot for our data.
airquality %>%
gather(key='Measurement', value='value', Ozone, Solar.R, Wind, Temp) %>%
ggplot() + aes(value, Month, group=Month) + #Here, we're saying we want value & month as our axes, then grouping the data by month
geom_density_ridges() +
facet_wrap(~ Measurement, scales = 'free')
ECDF stands for Empirical Commutative Distribution. This method is a useful alternative to a histogram as it plots the full range of a dataset without need for extra parameters.
First things first, we’ll perform an empirical commutative distribution function. This function reports any given number percentile of individuals that are above or below that threshold. First, we’ll create a dummy dataframe from randomized data:
set.seed(1234)
wdata=data.frame(
sex=factor(rep(c('F','M'), each=200)),
weight=c(rnorm(200,55), rnorm(200,58))
)
head(wdata, 5)
## sex weight
## 1 F 53.79293
## 2 F 55.27743
## 3 F 56.08444
## 4 F 52.65430
## 5 F 55.42912
Next, we’ll load our plotting package and set up our theme.
library(ggplot2)
theme_set(
theme_gray() +
theme(legend.position='bottom')
)
Now we’ll create our ECDF plot.
ggplot(wdata,
aes(x=weight)) +
stat_ecdf(aes(color=sex, linetype=sex),
geom='step', size=1.5) +
scale_color_manual(values=c('magenta','forestgreen')) +
labs(y='weight')
And now we’ll take a look at QQ Plots. These are used to determine if given data follows a normal distribution.
First, we’ll install the required packages and set our seed, then randomly generate our dummy data.
library(ggpubr)
library(ggplot2)
library(mnonr)
set.seed(1234)
wdata=data.frame(
sex=factor(rep(c('F','M'), each=200)),
weight=c(rnorm(200,55), rnorm(200,58))
)
head(wdata, 5)
## sex weight
## 1 F 53.79293
## 2 F 55.27743
## 3 F 56.08444
## 4 F 52.65430
## 5 F 55.42912
theme_set(
theme_dark() +
theme(legend.position='top')
)
Now we’ll create the QQ plot itself.
ggplot(wdata,
aes(sample=weight)) +
stat_qq(aes(color=sex)) +
scale_color_manual(values=c('#0073C2FF','#FC4E07')) +
labs(y='weight')
We can also generate cones to give a more easily apparent trend to the data.
ggqqplot(wdata, x='weight',
color='sex',
palettes=c('#0073C2FF','#FC4E07'),
ggtheme=theme_pubclean())
What would non-normal data look like on a QQ plot? To demonstrate this, we’ll use the mnonr package to create some random data.
data2 <- mnonr::mnonr(n=1000, p=2, ms=3, mk=61, Sigma=matrix(c(1, 0.5, 0.5, 1), 2, 2), initial=NULL)
data2 <- as.data.frame(data2)
Next, we’ll plot that non-normal data.
ggplot(data2,
aes(sample=V1)) +
stat_qq()
And here’s that same data using the shading we demonstrated earlier.
ggqqplot(data2, x='V1',
palette='forestgreen',
ggtheme=theme_pubclean())
GGPlot isn’t the only graphing package available in R, and Plotly has the unique ability to produce interactive plots. In addition, these interactive plots don’t even need to be 2D; the second section will demonstrate how to create an interactive 3D plot.
First let’s load our required package and use a built-in data set of orange tree ages and their circumference.
library(plotly)
Orange <- as.data.frame(Orange)
Next, let’s create a basic scatter plot of our data.
plot_ly(data = Orange, x=~age, y=~circumference)
If you mouse-over the plot above, you can get the exact values of each point. We can do more, though; let’s add some more info. We’ll have each color represent a different tree ID, and scale each point by age. Additionally, each point can be hovered over for detailed information.
plot_ly(data = Orange, x=~age, y=~circumference,
color = ~Tree, size = ~age,
text = ~paste('Tree ID:', Tree, '<br>Age:', age, 'Circ:', circumference)
)
Next, let’s create a random distribution and add it to our dataframe.
trace_1 <- rnorm(35, mean=120, sd=10)
new_data <- data.frame(Orange, trace_1)
We’ll use the random numbers as lines on the graph to demonstrate how Plotly can be used to display multiple different variables and graph types.
plot_ly(data=new_data, x=~age, y=~circumference, color=~Tree, size=~age, text=~paste('Tree ID:', Tree, '<br>Age:', age, 'Circ:', circumference)) %>%
add_trace(y=~trace_1, mode='lines') %>%
add_trace(y=~circumference, mode='markers')
Now let’s create a graph with the option of showing as a scatter or line, and add labels.
plot_ly(data=Orange, x=~age, y=~circumference, color=~Tree, size=~circumference, text=~paste('Tree ID:', Tree, '<br>Age:', age, 'Circ:', circumference)) %>%
add_trace(y=~trace_1, mode='lines') %>%
add_trace(y=~circumference, mode='markers') %>%
layout(
title='Plot of Orange Data with Switchable Trace',
updatemenus = list(
list(
type='downdrop',
y=0.8,
buttons=list(
list(method='restyle',
args=list('mode', 'markers'),
label='Marker'
),
list(method='restyle',
args=list('mode','lines'),
label='Lines'
)
)
)
)
)
First, we’ll load our required packages and create a random 3D matrix.
library(plotly)
d <- data.frame(
x <- seq(1,10,by=0.5),
y <- seq(1,10,by=0.5)
)
z <- matrix(rnorm(length(d$x)*length(d$y)), nrow=length(d$x), ncol=length(d$y))
Now let’s plot our 3D data.
plot_ly(d, x=~x, y=~y, z=~z) %>%
add_surface()
You can click and drag the above plot to spin the 3D plot, allowing for excellent interactivity with the user. Next, let’s add some topographical lines to our plot to add yet more information.
plot_ly(d, x=~x, y=~y, z=~z) %>%
add_surface(contours=list(z=list(show=TRUE, usecolormap=TRUE, highlightcolor='indianred', project=list(z=TRUE)
)
)
)
Now let’s try out a 3D scatter plot using Longley’s Economic Regression Data.
plot_ly(longley, x=~GNP, y=~Population, z=~Employed, marker=list(color=~GNP)) %>%
add_markers()
Aside from GGPlot and Plotly, there are more useful packages for niche applications, and there are also a number of general practices that would apply to any graph. Error bars are important for an array of various plots and datasets for appropriate presentation, facet plots can separate data for clarity, and heatmaps are frequently found in genetics research. Each of these is outlined in the above tabs.
As always, we’ll begin with loading our required libraries and setting up our dataset. In this case, we’ll use the built-in ToothGrowth dataset.
library(ggplot2)
library(dplyr)
library(plotrix)
theme_set(
theme_dark() +
theme(legend.position = 'bottom')
)
df <- ToothGrowth
df$dose <- as.factor(df$dose)
Now we’ll use dplyr to prepare our data for manipulation.
df.summary <- df %>%
group_by(dose) %>%
summarise(
sd = sd(len, na.rm=TRUE),
stderr = std.error(len, na.rm=TRUE),
len = mean(len)
)
df.summary
## # A tibble: 3 × 4
## dose sd stderr len
## <fct> <dbl> <dbl> <dbl>
## 1 0.5 4.50 1.01 10.6
## 2 1 4.42 0.987 19.7
## 3 2 3.77 0.844 26.1
Now let’s look at some key functions;
We’ll demonstrate these, but first must create our ggplot object.
tg <- ggplot(
df.summary,
aes(x=dose,y=len,ymin=len-sd, ymax=len+sd)
)
First, let’s test out basic error bars.
tg + geom_pointrange()
tg + geom_errorbar(width=0.2) +
geom_point(size=1.5)
Now we’ll create some horizontal error bars by manipulating our graph.
ggplot(df.summary,
aes(x=len, y=dose, xmin=len-sd, xmax=len+sd)) +
geom_point() +
geom_errorbarh(height=0.2)
This gives us a good idea of how error bars can be placed on the horizontal axis like this.
Next, we’ll look at adding jitter points (actual measurements) to our data.
ggplot(df,
aes(dose,len)) +
geom_jitter(position=position_jitter(0.2), color='white') +
geom_pointrange(aes(ymin=len-sd, ymax=len+sd), data=df.summary)
Next, let’s try placing error bars on a violin plot.
ggplot(df,
aes(dose,len)) +
geom_violin(color='white', trim=FALSE) +
geom_pointrange(aes(ymin=len-sd, ymax=len+sd), data=df.summary) +
geom_errorbar(aes(ymin=len-sd, ymax=len+sd), data=df.summary, width=0.2)
How about line graphs? Error bars can be added to those like so:
ggplot(df.summary,
aes(dose,len)) +
geom_line(aes(group=1)) + #always specify this when you have 1 line
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), width=0.2) +
geom_point(size=2)
Now we’ll make a bar graph with half error bars.
ggplot(df.summary,
aes(dose,len)) +
geom_col(fill='white', color='black') +
geom_errorbar(aes(ymin=len, ymax=len+stderr), width=0.2)
Alternatively, with full error bars;
ggplot(df.summary,
aes(dose,len)) +
geom_col(fill='white', color='black') +
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), width=0.2)
As you can see, the main difference is how we specified the geom_errorbar argument for ymin and ymax; by not including -stderr for the ymin, the lower bar completely overlaps with the top edge of each bar and is invisible.
Let’s next try adding jitter points to some line plots. To do so, we’ll need to use the original dataframe for the jitter plot, and the summary df for the geom layers.
ggplot(df,
aes(dose,len)) +
geom_jitter(position=position_jitter(0.2), color='white') +
geom_line(aes(group=1), data=df.summary) +
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), data=df.summary, width=0.2) +
geom_point(data=df.summary, size=2)
We can even add jitter points to a barplot.
ggplot(df,
aes(dose,len)) +
geom_col(data=df.summary, fill='steelblue', color='black') +
geom_jitter(position=position_jitter(0.2), color='white') +
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr),
data=df.summary, width=0.2) +
geom_point(data=df.summary, size=2)
What if we wanted separate error bars for different groups of our data? (OJ vs. VC) We’ll start by making a new summary object that groups our data by dose and supplement.
df.summary2 <- df %>%
group_by(dose, supp) %>%
summarise(
sd=sd(len),
stderr=std.error(len),
len=mean(len)
)
## `summarise()` has grouped output by 'dose'. You can override using the
## `.groups` argument.
df.summary2
## # A tibble: 6 × 5
## # Groups: dose [3]
## dose supp sd stderr len
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 0.5 OJ 4.46 1.41 13.2
## 2 0.5 VC 2.75 0.869 7.98
## 3 1 OJ 3.91 1.24 22.7
## 4 1 VC 2.52 0.795 16.8
## 5 2 OJ 2.66 0.840 26.1
## 6 2 VC 4.80 1.52 26.1
Next, we’ll plot our data.
ggplot(df.summary2,
aes(dose,len)) +
geom_pointrange(
aes(ymin=len-stderr, ymax=len+stderr, color=supp), position=position_dodge(0.3)) +
scale_color_manual(values=c('indianred','lightblue')) +
geom_errorbar(
aes(ymin=len-stderr, ymax=len+stderr, color=supp, group=supp),
width=0.2, position=position_dodge(0.3))
Let’s try a line plot with multiple error bars.
ggplot(df.summary2,
aes(dose,len)) +
geom_line(aes(linetype=supp, group=supp)) +
geom_point() +
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr, group=supp), width=0.2)
And the same with a bar plot:
ggplot(df.summary2,
aes(dose,len)) +
geom_col(aes(fill=supp), position=position_dodge(0.8), width=0.7) +
geom_errorbar(
aes(ymin=len-stderr, ymax=len+stderr, group=supp),
width=0.2, position=position_dodge(0.8)) +
scale_fill_manual(values=c('indianred','lightblue'))
Next, let’s try adding some jitter points to a line graph.
ggplot(df,
aes(dose, len, color=supp)) +
geom_jitter(position=position_dodge(0.2)) +
geom_line(aes(group=supp), data=df.summary2) +
geom_point() +
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr, group=supp), data=df.summary2, width=0.2)
And finally, we can also add jitter points to a bar graph to further illustrate the shape of our data.
ggplot(df,
aes(dose, len, color=supp)) +
geom_col(data=df.summary2, position=position_dodge(0.8), width=0.7, fill='black') +
geom_jitter(position=position_jitterdodge(jitter.width=0.2, dodge.width=0.8)) +
geom_errorbar(aes(ymin=len-stderr, ymax=len+stderr), data=df.summary2, width=0.2, position=position_dodge(0.8)) +
scale_color_manual(values=c('gold','white')) +
theme(legend.position='top')
Facet plots can be used for a number of different utilities, but we’ll start by showing how to incorporate two separate datasets into the same figure.
library(ggpubr)
library(ggplot2)
theme_set(
theme_dark() +
theme(legend.position='top')
)
First, we’ll make a nice boxplot.
df <- ToothGrowth
df$dose <- as.factor(df$dose)
p <- ggplot(df,
aes(x=dose,y=len)) +
geom_boxplot(aes(fill=supp), position=position_dodge(0.9)) +
scale_fill_manual(values=c('forestgreen','blue'))
p
Now let’s take a look at the ggplot facet function.
p +
facet_grid(rows=vars(supp))
Next, let’s use facets on data with multiple variables.
p +
facet_grid(rows=vars(dose), cols=vars(supp))
Let’s take a quick look at the facet_wrap function, which can place facets side-by-side.
p +
facet_wrap(vars(dose), ncol=2)
Now how would we go about combining multiple plots? We can use ggarrange(). We’ll start by making some basic plots.
my3cols <- c('forestgreen','blue','magenta') #This defines our color palette
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
p2 <- ggplot(ToothGrowth,
aes(x=dose, y=len))
bxp <- p2 +
geom_boxplot(aes(color=dose)) +
scale_color_manual(values=my3cols)
Now we’ll do a dotplot.
dp <- p2 + geom_dotplot(aes(color=dose, fill=dose),
binaxis='y', stackdir='center') +
scale_color_manual(values=my3cols) +
scale_fill_manual(values=my3cols)
And lastly, we’ll create a line plot.
lp <- ggplot(economics, aes(x=date, y=psavert)) + #we're mix-and-matching datasets just for visual comparison's sake
geom_line(color='indianred')
Finally, we can put the whole figure together.
figure <- ggarrange(bxp, dp, lp, labels=c('A','B','C'), ncol=2, nrow=2)
figure
This is good, but it can look better. Let’s rearrange so that our line plot is placed on top and stretched the full width of our figure.
figure2 <- ggarrange(
lp,
ggarrange(bxp, dp, ncol=2, labels=c('B','C')),
nrow=2,
labels='A')
figure2
This looks pretty good, but there’s the issue of B and C having redundant labels. Let’s fix that.
figure3 <- ggarrange(
lp,
ggarrange(bxp, dp, ncol=2, labels=c('B','C'), common.legend=TRUE, legend='bottom'),
nrow=2,
labels='A')
figure3
We should export our plot.
ggexport(figure3, filename='facetfigure.pdf')
You can also export multiple plots to a pdf.
ggexport(bxp, dp, lp, filename='multi.pdf')
Lastly, we can export to pdf with multiple pages and multiple columns.
ggexport(bxp, dp, lp, p, filename='test2.pdf', nrow=2, ncol=1)
Let’s get started with heatmaps. We’ll be installing and loading our package, then getting our data together. Our dataset is a time-series dataset, so we’ll have to do some transformations to it, then re-add our labels.
library(heatmap3)
data <- ldeaths
data2 <- do.call(cbind, split(data, cycle(data))) #This transforms the data into a dataframe-like structure, but we lose our labels
dimnames(data2) <- dimnames(.preformat.ts(data)) #This adds the labels back to our data
Now we’ll generate a basic heatmap.
heatmap(data2)
We can also remove the clustering from the upper version, organizing it in label order.
heatmap(data2, Rowv=NA, Colv=NA)
Next, we’ll mess around with the colors. We’ll use the rainbow function to generate a pallette.
rc <- rainbow(nrow(data2), start=0, end=0.3)
cc <- rainbow(ncol(data2), start=0, end=0.3)
Now we need to apply our color selections to our heatmap. The various colors denote how closely related the different columns are.
heatmap(data2, ColSideColors=cc)
We can also use the package RColorBrewer to access more options for our heatmap’s colors.
library(RColorBrewer)
heatmap(data2, ColSideColors=cc,
col = colorRampPalette(brewer.pal(8, 'PiYG'))(25))
There are more aspects that we can customize. We can add a small histogram for each column that is rendered to the top left. It displays the counts of each color, and will be done using the gplot package, not to be confused with ggplot2.
library(gplots)
heatmap.2(data2, ColSideColors=cc,
col=colorRampPalette(brewer.pal(8, 'PiYG'))(25))
Outliers can be very harmful to any data analysis, and for more reasons than simply skewing results. In R and many of its packages, a completely missing value can cause errors, or even make functions fail entirely. Addressing missing values, finding outliers, and analyzing covariance is important in any solid data analysis.
If you encounter unusual values in your data and want to move onto the rest of your analaysis, you have two options:
Below, we’ll try out method 1 and filter the diamonds dataset by removing every value of y unless it is between 3 and 20. I also added filters for x and z to remove similar outliers in diamonds2.5.
library(dplyr)
library(ggplot2)
diamonds <- diamonds
diamonds2 <- diamonds %>%
filter(between(y,3,20))
diamonds2.5 <- diamonds %>%
filter(between(y,3,20), between(x,3,11), between(z,2,6))
This way of filtering data isn’t always preferable as one bad measurement doesn’t necessarily mean we should toss the entire row; other measurements still may be useful.
Instead, you can use the second method: replacing unusual values with missing values (N/A)
Below, we’ll mutate the diamonds dataset to change that set of y values we filtered (below 3 and above 20) into NA values, or missing values. diamonds3.5 does the same with the x and z values.
diamonds3 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
diamonds3.5 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y), x = ifelse(x < 3 | x > 11, NA, x), z = ifelse(z < 2 | z > 6, NA, z))
Like R, ggplot2 works under the assumption that missing values should still be accounted for and considered in analysis.
ggplot(data = diamonds3, mapping = aes(x=x, y=y)) +
geom_point()
## Warning: Removed 9 rows containing missing values (geom_point).
As you can see, ggplot alerts you that it has removed rows containing missing values.
To supress that warning, you can use na.rm = TRUE, as demonstrated below.
ggplot(data = diamonds3, mapping= aes(x=x,y=y)) +
geom_point(na.rm = TRUE)
Sometimes, we want to understand what makes observations with missing values different from recorded values; perhaps recorded values could allow you to infer why data is missing.
For example, in the NYCflights13 dataset, missing values from departure time (dep_time) indicate a canceled flight.
You may want to compare scheduled departure times for cancelled and non-ccancelled flights:
library(nycflights13)
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dept_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping=aes(sched_dep_time)) +
geom_freqpoly(mapping=aes(color=cancelled), binwidth=1/4)
Let’s say we want to find outliers from a dataset. This section will focus on that, using a dataset on air quality that I stored locally.
We’ll start out like always, loading our required libraries and dataset:
library(outliers)
library(ggplot2)
library(readxl)
Air_data <- read_excel('~/School/Fall 2022/Bioinf/R Files/AirQualityUCI.xlsx')
Let’s create a function using the grubbs test to identify all outliers. The grubbs test identifies outliers in a univariate dataset that is presumed to come from a normal distribution. Below, I’ve commented in-line with the code as each section is worth providing details on.
grubbs.flag <- function(x) {
#We'll start by creating a variable called outliers and save nothing in it; we'll add to it as we identify outliers
outliers <- NULL
#We'll also create a variable called test to identify which univariate we're testing
test <- x
#Now, using the outliers package, we'll use grubbs.test to find outliers in our variable
grubbs.result <- grubbs.test(test)
#Let's get the p-values of all tested variables
pv <- grubbs.result$p.value
#Now we'll search through those p-values for ones that are outside of 0.05
while(pv<0.05) {outliers <- c(outliers, as.numeric(strsplit(grubbs.result$alternative, ' ')[[1]][3]))
#This says that anything with a p-value greater than 0.05 is added to our empty outliers vector
test <- x[!x %in% outliers]
#This finds everything in our data that is not significant
#Next, we'll run the grubbs test again without the outliers
grubbs.result <- grubbs.test(test)
#Then finally save the new p-values from the test
pv <- grubbs.result$p.value
}
return(data.frame(x=x, outliers = (x %in% outliers)))
}
Now let’s run our function.
identified_outliers <- grubbs.flag(Air_data$AH)
Now we can create a histogram showing said outliers
ggplot(grubbs.flag(Air_data$AH), aes(x=Air_data$AH, color=outliers, fill=outliers))+
geom_histogram(binwidth=diff(range(Air_data$AH))/30)+
theme_bw()
Covariance is the idea that two variables’ changes are tied to one another, and finding the presence of such a thing can provide incredibly valuable insight into the trends of many datasets. We’ll start with categorical data, specifically with diamonds, to help illustrate this. The first plot we’ll use is GGPlot’s frequency polygon diagram:
library(ggplot2)
ggplot(data = diamonds, mapping=aes(x=price)) +
geom_freqpoly(mapping=aes(color=cut), binwidth=500)
Though this does a decent job at illustrating relative amounts of gemstone cuts, it’s hard to really estimate the difference of distribution due to the differing counts.
Let’s look at the counts of cut grade alone, without value:
ggplot(diamonds)+
geom_bar(mapping=aes(x=cut))
To make the comparison easier, we’ll swap the display on the y-axis. Instead of displaying count, we’ll display density: the same counts but standardized so the area under the curve is one.
ggplot(data=diamonds, mapping=aes(x=price,y=..density..))+
geom_freqpoly(mapping=aes(color=cut), binwidth=500)
It would appear that fair diamonds (the lowest cut quality) have the highest average price. Perhaps this is due to frequency polygons being a little hard to interpret.
An alternative is the boxplot. A boxplot is a type of visual shorthand for a distribution of values.
ggplot(data=diamonds, mapping=aes(x=cut, y=price))+
geom_boxplot()
So here, we see much less information about the distribution, but the boxplots are more compact and can be more easily compared with each other (clearer averages and whatnot)
This supports the counterintuitive finding that better quality diamonds are cheaper on average.
Let’s look at some car data next. We’ll plot car class against their highway miles per gallon.
ggplot(data = mpg, mapping=aes(x=class, y=hwy))+
geom_boxplot()
It’s a bit messy, so let’s order them by ascending average highway mpg
ggplot(data=mpg)+
geom_boxplot(mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy))
If you have long variable names, you can switch the axis and flip it 90 degrees.
ggplot(data=mpg)+
geom_boxplot(mapping=aes(x=reorder(class,hwy, FUN=median), y=hwy))+
coord_flip()
To visualize the correlation between two continuous variables, we can use a scatter plot.
ggplot(data=diamonds)+
geom_point(mapping=aes(x=carat,y=price))
Scatterplots may be less useful as the size of your dataset grows, because we get something called overplot. We can fix this using an aesthetic option called alpha.
ggplot(data=diamonds)+
geom_point(mapping=aes(x=carat, y=price), alpha=5/100)
Now, we can see that the points sort of ‘shade’ locations with large numbers of points, so we can see concentrations more clearly than a flat scatter graph.
Data can often comee to you in a form that requires exploration and experimentation to find ways to best draw conclusions, then display those conclusions. In this section, we’ll walk through a bunch of different general techniques that one might take in exploring data.
First, we’ll load required libraries and set up some data. RCurl allows us to load data from remote sites, and our data is being sourced from nytimes’ covid-19 college data.
library(RCurl)
library(dplyr)
site <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/colleges/colleges.csv"
College_Data <- read.csv(site)
Next, we’ll use the str function to show the structure of our data.
str(College_Data)
## 'data.frame': 1948 obs. of 9 variables:
## $ date : chr "2021-05-26" "2021-05-26" "2021-05-26" "2021-05-26" ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr "Madison" "Montgomery" "Limestone" "Lee" ...
## $ city : chr "Huntsville" "Montgomery" "Athens" "Auburn" ...
## $ ipeds_id : chr "100654" "100724" "100812" "100858" ...
## $ college : chr "Alabama A&M University" "Alabama State University" "Athens State University" "Auburn University" ...
## $ cases : int 41 2 45 2742 220 4 263 137 49 76 ...
## $ cases_2021: int NA NA 10 567 80 NA 49 53 10 35 ...
## $ notes : chr "" "" "" "" ...
What if we wanted to arrange our dataset alphabetically by college?
alphabetical <- College_Data %>%
arrange(College_Data$college)
The glimpse package also gives us another way to preview our data.
glimpse(College_Data)
## Rows: 1,948
## Columns: 9
## $ date <chr> "2021-05-26", "2021-05-26", "2021-05-26", "2021-05-26", "20…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
## $ county <chr> "Madison", "Montgomery", "Limestone", "Lee", "Montgomery", …
## $ city <chr> "Huntsville", "Montgomery", "Athens", "Auburn", "Montgomery…
## $ ipeds_id <chr> "100654", "100724", "100812", "100858", "100830", "102429",…
## $ college <chr> "Alabama A&M University", "Alabama State University", "Athe…
## $ cases <int> 41, 2, 45, 2742, 220, 4, 263, 137, 49, 76, 67, 0, 229, 19, …
## $ cases_2021 <int> NA, NA, 10, 567, 80, NA, 49, 53, 10, 35, 5, NA, 10, NA, 19,…
## $ notes <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
We can subset our data with select()
College_Cases <- select(College_Data, college, cases)
We can also filter or subset using the filter() function.
Louisiana_Cases <- filter(College_Data, state == 'Louisiana')
Let’s just filter out for a small number of states.
South_Cases <- filter(College_Data, state == 'Louisiana' | state == 'Texas' | state == 'Arkansas' | state == 'Mississippi')
How about next, we look at some time series data? Starting with loading the packages and data:
library(lubridate)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(scales)
state_site <- 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
State_Data <- read.csv(state_site)
Let’s make a group_by object using the state column.
State_Cases <- group_by(State_Data, state)
class(State_Cases)
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
How many measurements were made by state? This gives us an idea of when states started reporting.
Days_since_first_report <- tally(State_Cases)
Let’s next do some visualization of our data, but we’ll start with definitions.
Data - of course is the stuff we want to visualize
Layer - a component made of geometric elements and requisite statistical info. Includes geometric objects which represent the plot
Scales - used to map values in the data space that are used for creation of meaning (color, size, shape, etc.)
Coordinate system - describes how a data point is mapped relative to other data on the graphic
Faceting - how data can be broken up into subsets and diplayed in multiple formats or in different groups
Theme - controls the finer points of the display, such as font size and background color
Let’s look at a way to very simply display a dataframe in a clear way:
options(repr.plot.width=6, repr.plot.height=6)
class(College_Data)
## [1] "data.frame"
head(College_Data)
## date state county city ipeds_id
## 1 2021-05-26 Alabama Madison Huntsville 100654
## 2 2021-05-26 Alabama Montgomery Montgomery 100724
## 3 2021-05-26 Alabama Limestone Athens 100812
## 4 2021-05-26 Alabama Lee Auburn 100858
## 5 2021-05-26 Alabama Montgomery Montgomery 100830
## 6 2021-05-26 Alabama Walker Jasper 102429
## college cases cases_2021 notes
## 1 Alabama A&M University 41 NA
## 2 Alabama State University 2 NA
## 3 Athens State University 45 10
## 4 Auburn University 2742 567
## 5 Auburn University at Montgomery 220 80
## 6 Bevill State Community College 4 NA
summary(College_Data)
## date state county city
## Length:1948 Length:1948 Length:1948 Length:1948
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## ipeds_id college cases cases_2021
## Length:1948 Length:1948 Min. : 0.0 Min. : 0.0
## Class :character Class :character 1st Qu.: 32.0 1st Qu.: 23.0
## Mode :character Mode :character Median : 114.5 Median : 65.0
## Mean : 363.5 Mean : 168.1
## 3rd Qu.: 303.0 3rd Qu.: 159.0
## Max. :9914.0 Max. :3158.0
## NA's :337
## notes
## Length:1948
## Class :character
## Mode :character
##
##
##
##
Let’s look at a different dataset in the same way:
iris <- as.data.frame(iris)
class(iris)
## [1] "data.frame"
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Let’s create a scatter plot of the College Data.
ggplot(data=College_Data, aes(x=cases, y=cases_2021))+
geom_point()+
theme_minimal()
Now let’s look at the iris dataset.
ggplot(data=iris, aes(x=Sepal.Width, y=Sepal.Length))+
geom_point()+
theme_minimal()
Let’s color-coordinate our college data
ggplot(data=College_Data, aes(x=cases, y=cases_2021, color=state))+
geom_point()+
theme_minimal()
And let’s do the same for the iris data
ggplot(data=iris, aes(x=Sepal.Width, y=Sepal.Length, color=Species))+
geom_point()+
theme_minimal()
We’ll next create a simple histogram using our Louisiana Case data
hist(Louisiana_Cases$cases, freq=NULL, density=NULL, breaks=10, xlab='Total Cases', ylab='Frequency', main='Total College Covid-19 Infections (Louisiana)')
And the same for the iris data:
hist(iris$Sepal.Width, freq=NULL, density=NULL, breaks=10, xlab='Sepal Width', ylab='Frequency', main='Iris Sepal Width')
Next let’s get fancy. We’ll set up the histogram to color-code based on the county/parish of Louisiana, stacking them atop one another in a fashion similar to a stacked bar graph.
histogram_college <- ggplot(data=Louisiana_Cases, aes(x=cases))
histogram_college + geom_histogram(binwidth=100, color='black', aes(fill=county))+
xlab('cases') + ylab('Frequency') + ggtitle('Histogram of Covid-19 Cases in Louisiana')
And here we do a similar thing to the iris data, just using species as our delimiting factor rather than county/parish.
histogram_iris <- ggplot(data=iris, aes(x=Sepal.Width))
histogram_iris + geom_histogram(binwidth=0.2, color='black', aes(fill=Species)) +
xlab('Sepal Width') + ylab('Frequency') + ggtitle('Histogram of Iris Sepal Widths by Species')
Maybe a density plot makes more sense for our college data
ggplot(South_Cases)+
geom_density(aes(x=cases, fill=state), alpha=0.35)
Again with the iris data:
ggplot(iris) +
geom_density(aes(x=Sepal.Width, fill=Species), alpha=0.35)
Let’s instead try setting up some violin plots for sepal length by species.
ggplot(data=iris, aes(x=Species, y=Sepal.Length, color=Species)) +
geom_violin() +
theme_dark() +
theme(legend.position='none')
Let’s do the same with some of our covid data:
ggplot(data=South_Cases, aes(x=state, y=cases, color=state))+
geom_violin()+
theme_dark()+
theme(legend.position='none')
Now let’s take a look at residual plots. This is a graph that displays the residuals on the vertical axis and the independent variable on the horizontal. In the event that the points in a residual plot are dispersed in a random manner around the horizontal axis, it’s appropriate to use a linear regression. If they are not randomly dispersed, a nonlinear model is more appropriate.
We’ll start demonstrating this using the iris data
ggplot(lm(Sepal.Length~Sepal.Width, data=iris))+
geom_point(aes(x=.fitted, y=.resid))
As you can see, the iris data appears quite random, so this data could be considered normal and modeled easily by a linear model.
Let’s look at southern states’ covid cases instead:
ggplot(lm(cases~cases_2021, data=South_Cases))+
geom_point(aes(x=.fitted, y=.resid))
In this case, there’s a very clear bias to the data, indicating that a nonlinear model should be used with it.
Now let’s look at some correlations. We’ll load a new dataset about obesity and insurance rates, as well as our required packages.
library(tidyr)
library(plyr)
library(dplyr)
obesity <- read.csv('~/School/Fall 2022/Bioinf/R Files/Exploratory Data Analysis/Obesity_insurance.csv')
# Let's also check its class and structure
class(obesity)
## [1] "data.frame"
str(obesity)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Let’s also get a summary of distribution of our variables.
summary(obesity)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
Now let’s look at the distribution for insurance charges.
hist(obesity$charges)
We can also look at it as a boxplot.
boxplot(obesity$charges)
Now let’s actually start looking for correlations in our data. The cor() command is used to determine correlations between two vectors, all of the columns in a data frame, or two data frames. The cov() command, however, examines covariance. The cor.test() command carries out a test as to the significance of the correlation.
cor(obesity$charges, obesity$bmi)
## [1] 0.198341
This test uses a Spearman Rho correlation, or you can use Kendall’s Tau by specifying it
cor(obesity$charges, obesity$bmi, method='kendall')
## [1] 0.08252397
This correlation measures strength of correlation between -1 and 1, so a value close to 0 indicates a very low amount of data correlation.
Let’s next look at the Tietjen-Moore test. This is used for univariate datasets. The algorithm depicts the detection of the outliers in a univariate dataset.
TietjenMoore <- function(dataSeries, k)
{
n = length(dataSeries)
#Compute the absolute residuals
r = abs(dataSeries - mean(dataSeries))
#Sort data according to size of residual
df = data.frame(dataSeries, r)
dfs = df[order(df$r),]
#Create a subset of the data without the largest values
klarge = c((n-k+1):n)
subdataSeries = dfs$dataSeries[-klarge]
#Compute the sum of squares
ksub = (subdataSeries - mean(subdataSeries))**2
all = (df$dataSeries - mean(df$dataSeries))**2
#Compute the test statistic
sum(ksub)/sum(all)
}
This function helps to compute the absolute residuals and sorts data according to the size of the residuals. Later, we will focus on the computation of sum of squares.
FindOutliersTietjenMooreTest <- function(dataSeries, k, alpha=0.5){
ek <- TietjenMoore(dataSeries, k)
#Compute critical values based on simulation
test = c(1:10000)
for (i in 1:10000){
dataSeriesdataSeries=rnorm(length(dataSeries))
test[i]=TietjenMoore(dataSeriesdataSeries, k)}
Talpha=quantile(test, alpha)
list(T=ek, Talpha=Talpha)
}
This function helps us to compute the critical values based on simulation data. Now let’s demonstrate these functions with sample data and the obesity dataset for evaluating this algorithm.
The critical region fo rthe Tietjen-Moore test is determined by simulation. The simulation is performed by generating a standard normal random sample of size n and computing the Tietjen Moore test statistic. Typically, 10,000 random samples are used. The values of the Tietjen-Moore statistic obtained from the data is compared to this reference distribution. The values of the test statistic is between zero and one. If there are no outliers in the data, the test statistic is close to 1. As more outliers appear, the statistic drifts closer to 0. Thus, the test is always a lower, one-tailed test regardless of which test statistic is used - Lk or Ek.
First, we’ll look at charges. Note: the number on our second command is the expected number of outliers.
boxplot(obesity$charges)
FindOutliersTietjenMooreTest(obesity$charges, 50)
## $T
## [1] 0.6751565
##
## $Talpha
## 50%
## 0.7731745
Let’s check out bmi next.
boxplot(obesity$bmi)
FindOutliersTietjenMooreTest(obesity$bmi, 7)
## $T
## [1] 0.9475628
##
## $Talpha
## 50%
## 0.9505247
As per usual, we’ll begin by loading our necessary packages. Note that a new package to this section is tigerstats.
library(ggplot2)
library(tigerstats)
We will use the probability plot function and their output dnorm: density function of the normal distribution. Using the density, it is possibile to determine the probability of events.
For example, you may wonder “What is the likelihood that a person has a BMI of exactly X?” In this case, you would need to retrieve the density of the IQ distribution at value X.
Given the BMI distribution is normal and can be modeled with a mean of 100 and a standard deviation of 15, we can find the corresponding density. Below, we input our parameters and generate our density, then plot a scatter plot of bmi by density.
bmi.mean <- mean(obesity$bmi)
bmi.sd <- sd(obesity$bmi)
bmi.dist <- dnorm(obesity$bmi, mean=bmi.mean, sd=bmi.sd)
bmi.df <- data.frame('bmi'=obesity$bmi, 'Density'=bmi.dist)
ggplot(bmi.df, aes(x=bmi, y=Density))+
geom_point()
This gives us the probability of each individual point occurring. Next, let’s use the pnorm function for another way to display it.
bmi.dist <- pnorm(obesity$bmi, mean=bmi.mean, sd=bmi.sd)
bmi.df <- data.frame('bmi'=obesity$bmi, 'Density'=bmi.dist)
ggplot(bmi.df, aes(x=bmi, y=Density))+
geom_point()
What if we want to find the probability of the bmi being greater than 40 in our distribution?
pp_greater <- function(x){
paste(round(100*pnorm(x,, mean=30.66339, sd=6.09818, lower.tail=FALSE), 2), '%')
}
pp_greater(40)
## [1] "6.29 %"
To graphically represent everything included with this function (BMI over 40), we can use pnormGC.
pnormGC(40, region='above', mean=30.66339, sd=9.0818, graph=TRUE)
## [1] 0.1519615
What about the probability that a bmi is less than 40 in our distribution?
pp_less <- function(x){
paste(round(100*(1-pnorm(x,, mean=30.66339, sd=6.09818, lower.tail=FALSE)), 2), '%')
}
pp_less(40)
## [1] "93.71 %"
pnormGC(40, region='below', mean=30.66339, sd=9.0818, graph=TRUE)
## [1] 0.8480385
What if we want to find the area in between 20 and 40?
pnormGC(c(20,40), region='between', mean=30.66339, sd=6.09818, graph=TRUE)
## [1] 0.8969428
What if we want to know the quantiles? Let’s use the pnorm function. We must assume a normal distribution for this.
What bmi represents the lowest 1% of the population?
qnorm(0.01, mean=30.66339, sd=6.09818, lower.tail=TRUE)
## [1] 16.4769
What if you want a random sampling of values within your distribution?
subset <- rnorm(50, mean=30.66339, sd=6.09818)
hist(subset)
Let’s do it again with more samples from our data. Remember = larger sample sizes means more normal data.
subset <- rnorm(5000, mean=30.66339, sd=6.09818)
hist(subset)
Shapiro-Wilk Tests
So we know how to generate a normal distribution. How do we tell if our samples come from a normal distribution in the first place?
shapiro.test(obesity$charges[1:5])
##
## Shapiro-Wilk normality test
##
## data: obesity$charges[1:5]
## W = 0.84164, p-value = 0.1695
As you can see, with a small sample size, we reject the null hypothesis that the sample comes from a normal distribution. We can increase the power - the capacity of the test to discern detail - by increasing sample size.
shapiro.test(obesity$charges[1:1000])
##
## Shapiro-Wilk normality test
##
## data: obesity$charges[1:1000]
## W = 0.8119, p-value < 2.2e-16
Next let’s look at age.
shapiro.test(obesity$age[1:1000])
##
## Shapiro-Wilk normality test
##
## data: obesity$age[1:1000]
## W = 0.94406, p-value < 2.2e-16
And finally, BMI.
shapiro.test(obesity$bmi[1:1000])
##
## Shapiro-Wilk normality test
##
## data: obesity$bmi[1:1000]
## W = 0.99471, p-value = 0.001426
Let’s start this section by loading our packages and data:
library(readr)
library(readxl)
library(lubridate)
library(hms)
Air_data <- read_excel('~/School/Fall 2022/Bioinf/R Files/Exploratory Data Analysis/AirQualityUCI.xlsx')
What is all contained within our data?
Let’s get a view on our data’s structure:
str(Air_data)
## tibble [9,357 × 15] (S3: tbl_df/tbl/data.frame)
## $ Date : POSIXct[1:9357], format: "2004-03-10" "2004-03-10" ...
## $ Time : POSIXct[1:9357], format: "1899-12-31 18:00:00" "1899-12-31 19:00:00" ...
## $ CO(GT) : num [1:9357] 2.6 2 2.2 2.2 1.6 1.2 1.2 1 0.9 0.6 ...
## $ PT08.S1(CO) : num [1:9357] 1360 1292 1402 1376 1272 ...
## $ NMHC(GT) : num [1:9357] 150 112 88 80 51 38 31 31 24 19 ...
## $ C6H6(GT) : num [1:9357] 11.88 9.4 9 9.23 6.52 ...
## $ PT08.S2(NMHC): num [1:9357] 1046 955 939 948 836 ...
## $ NOx(GT) : num [1:9357] 166 103 131 172 131 89 62 62 45 -200 ...
## $ PT08.S3(NOx) : num [1:9357] 1056 1174 1140 1092 1205 ...
## $ NO2(GT) : num [1:9357] 113 92 114 122 116 96 77 76 60 -200 ...
## $ PT08.S4(NO2) : num [1:9357] 1692 1559 1554 1584 1490 ...
## $ PT08.S5(O3) : num [1:9357] 1268 972 1074 1203 1110 ...
## $ T : num [1:9357] 13.6 13.3 11.9 11 11.2 ...
## $ RH : num [1:9357] 48.9 47.7 54 60 59.6 ...
## $ AH : num [1:9357] 0.758 0.725 0.75 0.787 0.789 ...
Let’s tweak our time column by removing the odd redundant dates.
Air_data$Time <- as_hms(Air_data$Time)
glimpse(Air_data)
## Rows: 9,357
## Columns: 15
## $ Date <dttm> 2004-03-10, 2004-03-10, 2004-03-10, 2004-03-10, 2004-…
## $ Time <time> 18:00:00, 19:00:00, 20:00:00, 21:00:00, 22:00:00, 23:…
## $ `CO(GT)` <dbl> 2.6, 2.0, 2.2, 2.2, 1.6, 1.2, 1.2, 1.0, 0.9, 0.6, -200…
## $ `PT08.S1(CO)` <dbl> 1360.00, 1292.25, 1402.00, 1375.50, 1272.25, 1197.00, …
## $ `NMHC(GT)` <dbl> 150, 112, 88, 80, 51, 38, 31, 31, 24, 19, 14, 8, 16, 2…
## $ `C6H6(GT)` <dbl> 11.881723, 9.397165, 8.997817, 9.228796, 6.518224, 4.7…
## $ `PT08.S2(NMHC)` <dbl> 1045.50, 954.75, 939.25, 948.25, 835.50, 750.25, 689.5…
## $ `NOx(GT)` <dbl> 166, 103, 131, 172, 131, 89, 62, 62, 45, -200, 21, 16,…
## $ `PT08.S3(NOx)` <dbl> 1056.25, 1173.75, 1140.00, 1092.00, 1205.00, 1336.50, …
## $ `NO2(GT)` <dbl> 113, 92, 114, 122, 116, 96, 77, 76, 60, -200, 34, 28, …
## $ `PT08.S4(NO2)` <dbl> 1692.00, 1558.75, 1554.50, 1583.75, 1490.00, 1393.00, …
## $ `PT08.S5(O3)` <dbl> 1267.50, 972.25, 1074.00, 1203.25, 1110.00, 949.25, 73…
## $ T <dbl> 13.600, 13.300, 11.900, 11.000, 11.150, 11.175, 11.325…
## $ RH <dbl> 48.875, 47.700, 53.975, 60.000, 59.575, 59.175, 56.775…
## $ AH <dbl> 0.7577538, 0.7254874, 0.7502391, 0.7867125, 0.7887942,…
Now let’s do a simple plot of our humidity data.
plot(Air_data$AH, Air_data$RH, main='Humidity Analysis', xlab='Absolute Humidity', ylab='Relative Humidity')
Notice that we have a strong outlier in our data. Let’s run a t-test.
t.test(Air_data$RH, Air_data$AH)
##
## Welch Two Sample t-test
##
## data: Air_data$RH and Air_data$AH
## t = 69.62, df = 17471, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 45.01707 47.62536
## sample estimates:
## mean of x mean of y
## 39.483611 -6.837604
As you can see, the averages for our humidity data are significantly different; that outlier is just causing havoc with our plot.
Text mining is the process of looking through sets of strings, often from social media or publications, and analyzing them on the basis of various different parameters. Most frequently, text mining is used in social media to track sentiment towards companies, social issues, and political candidates or parties. It can also, however, be useful in tracking the frequency of certain words in published words, as well as deduce the percentages of such publications that focus on certain subjects.
Let’s start with the unnest_token function. We’ll begin by looking at an Emily Dickenson passage.
text <- c("Because I could not stop from Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text
## [1] "Because I could not stop from Death -"
## [2] "He kindly stopped for me -"
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"
This is an example of a typical character vector that we might want to analyze. We’ll use something called tidytext to make it more easily manipulable. To turn this into a tidytext dataset, we’ll need to put it into a dataframe.
library(dplyr)
text_df <- tibble(line=1:4, text=text)
text_df
## # A tibble: 4 × 2
## line text
## <int> <chr>
## 1 1 Because I could not stop from Death -
## 2 2 He kindly stopped for me -
## 3 3 The Carriage held but just Ourselves -
## 4 4 and Immortality
Reminder: a tibble is a modern class of dataframe within R. It’s available in the dplyr and tibble packages, has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools.
Now, we’ll use the unnest_tokens function.
First, we have to establish the output column name that will be created as the text is unnested into it. We’ll adjust our text so each word is separated into a word column, associated with a line #. This will be done using the tidytext package.
library(tidytext)
text_df %>%
unnest_tokens(word, text)
## # A tibble: 20 × 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 from
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## 11 2 for
## 12 2 me
## 13 3 the
## 14 3 carriage
## 15 3 held
## 16 3 but
## 17 3 just
## 18 3 ourselves
## 19 4 and
## 20 4 immortality
Let’s use the janeaustenr package to analyze some Jane Austen texts. There are six books in this package.
The below code will group the text by book, then mutate the table such that the line numbers equals the row number, and chapter number is also accounted in its own column.
library(janeaustenr)
library(stringr)
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber=row_number(),
chapter=cumsum(str_detect(text, regex('^chapter [\\divxlc]',
ignore_case=TRUE)))) %>%
ungroup()
original_books
## # A tibble: 73,422 × 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 "by Jane Austen" Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 "(1811)" Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 "CHAPTER 1" Sense & Sensibility 10 1
## # … with 73,412 more rows
To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
tidybooks <- original_books %>%
unnest_tokens(word, text)
tidybooks
## # A tibble: 725,055 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # … with 725,045 more rows
This function uses the tokenizers package to separate each line of text in the original dataframe into tokens. The default tokenizing is for words, but other options including characters, n-grams, sentences, lines, or paragraphs can be used.
Now that the data is in a one-word-per-row format, we can manipulate it with tools like dplyr. Often in text analysis, we will want to remove stop words.
Stop words are words that are not useful for an analysis, such as ‘the’, ‘of’, ‘and’, and so on. We can remove stop words (kept in the tidytext dataset ‘stop_words’) with an anti_join().
data(stop_words)
tidier_books <- tidybooks %>%
anti_join(stop_words)
## Joining, by = "word"
The stop_words dataset in the tidytext package contains stop words from three lexicons. We can use them all together as we have here, or filter() to only use on set of stop if that’s more appropriate for your analysis. Naturally, you can also make your own dictionary dataset to filter specific words.
Let’s compare word counts.
tidybooks %>%
count(word, sort=TRUE)
## # A tibble: 14,520 × 2
## word n
## <chr> <int>
## 1 the 26351
## 2 to 24044
## 3 and 22515
## 4 of 21178
## 5 a 13408
## 6 her 13055
## 7 i 12006
## 8 in 11217
## 9 was 11204
## 10 it 10234
## # … with 14,510 more rows
tidier_books %>%
count(word, sort=TRUE)
## # A tibble: 13,914 × 2
## word n
## <chr> <int>
## 1 miss 1855
## 2 time 1337
## 3 fanny 862
## 4 dear 822
## 5 lady 817
## 6 sir 806
## 7 day 797
## 8 emma 787
## 9 sister 727
## 10 house 699
## # … with 13,904 more rows
Because we’ve been using tidy tools, our word counts are stored in a tidy data frame, allowing us to pipe this directly into ggplot2. For example, we can create a visualization of the most common words.
library(ggplot2)
tidier_books %>%
count(word, sort=TRUE) %>%
filter(n > 600) %>% #Include words counted at 600 or more only
mutate(word=reorder(word, n)) %>% #Reorder by word quantity
ggplot(aes(n,word)) +
geom_col() +
labs(y=NULL, x='word count')
Let’s look at the gutenbergr package next.
This package provides access to the public domain works from the gutenberg project (www.gutenberg.org). This package includes tools for both downloading books and a complete dataset of project gutenberg metadata that can be used to find works of interest. We will mostly use the function gutenberg_download().
We’ll look at word frequencies more first using some biology texts from Darwin:
The Voyage of the Beagle - 944 On the origin of species by means of natural selection - 1228 The expression of emotions in man and animals - 1227 The descent of man, and selection in relation to sex - 2300
library(gutenbergr)
darwin <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = "https://mirror2.sandyriver.net/pub/gutenberg")
Let’s break these into tokens
tidy_darwin <- darwin %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
Let’s see what the most common Darwin words are
tidy_darwin %>%
count(word, sort=TRUE)
## # A tibble: 23,630 × 2
## word n
## <chr> <int>
## 1 species 2998
## 2 male 1672
## 3 males 1337
## 4 animals 1310
## 5 birds 1292
## 6 female 1197
## 7 sexes 1095
## 8 females 1038
## 9 selection 1038
## 10 sexual 801
## # … with 23,620 more rows
Now let’s get some work from Thomas Hunt Morgan, who is credited with discovering chromosomes.
Regeneration - 57198 The genetic and operative evidence relating to secondary sexual characteristics - 57460 Evolution and Adapatation - 63540
morgan <- gutenberg_download(c(57198, 57460, 63540), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')
Next, we’ll tokenize the words
tidy_morgan <- morgan %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
What are the most common words therein?
tidy_morgan %>%
count(word, sort=TRUE)
## # A tibble: 13,855 × 2
## word n
## <chr> <int>
## 1 species 869
## 2 regeneration 814
## 3 piece 702
## 4 cut 669
## 5 male 668
## 6 forms 631
## 7 selection 604
## 8 cells 576
## 9 found 552
## 10 development 546
## # … with 13,845 more rows
Lastly, we’ll look at Thomas Henry Huxley.
Evidence as to mans place in nature - 2931 On the reception of the Origin of Species - 2089 Evolution and Ethics, and Other essays - 2940 Science and Culture, and other essays - 52344
huxley <- gutenberg_download(c(2931, 2089, 2940, 52344), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')
tidy_huxley <- huxley %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
And let’s tally words.
tidy_huxley %>%
count(word, sort=TRUE)
## # A tibble: 16,090 × 2
## word n
## <chr> <int>
## 1 species 339
## 2 nature 331
## 3 time 287
## 4 life 286
## 5 existence 255
## 6 knowledge 238
## 7 animals 227
## 8 natural 223
## 9 animal 216
## 10 science 207
## # … with 16,080 more rows
Now, let’s calculate the frequency for each word for the works of Darwin, Morgan, and Huxley by binding the frames together.
library(tidyr)
frequency <- bind_rows(mutate(tidy_morgan, author = 'Thomas Hunt Morgan'),
mutate(tidy_darwin, author = 'Charles Darwin'),
mutate(tidy_huxley, author = 'Thomas Henry Huxley')) %>%
mutate(word = str_extract(word, '[a-z]+')) %>% #only count words starting with a-z
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n/sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = author, values_from = proportion) %>%
pivot_longer('Thomas Hunt Morgan':'Charles Darwin', names_to = 'author', values_to = 'proportion')
frequency
## # A tibble: 95,082 × 3
## word author proportion
## <chr> <chr> <dbl>
## 1 a Thomas Hunt Morgan 0.00206
## 2 a Thomas Henry Huxley 0.0000856
## 3 a Charles Darwin 0.000141
## 4 ab Thomas Hunt Morgan 0.000165
## 5 ab Thomas Henry Huxley 0.0000978
## 6 ab Charles Darwin 0.00000642
## 7 abaiss Thomas Hunt Morgan NA
## 8 abaiss Thomas Henry Huxley NA
## 9 abaiss Charles Darwin 0.00000642
## 10 abandon Thomas Hunt Morgan 0.00000752
## # … with 95,072 more rows
Now, we need to change the table so that each author has its own row.
frequency2 <- pivot_wider(frequency, names_from='author', values_from='proportion')
frequency2
## # A tibble: 31,694 × 4
## word `Thomas Hunt Morgan` `Thomas Henry Huxley` `Charles Darwin`
## <chr> <dbl> <dbl> <dbl>
## 1 a 0.00206 0.0000856 0.000141
## 2 ab 0.000165 0.0000978 0.00000642
## 3 abaiss NA NA 0.00000642
## 4 abandon 0.00000752 0.0000122 0.00000321
## 5 abandoned 0.0000150 0.0000245 0.00000321
## 6 abashed NA NA 0.00000321
## 7 abatement NA 0.0000245 0.00000321
## 8 abbot NA 0.0000245 0.00000321
## 9 abbott NA NA 0.00000642
## 10 abbreviated NA NA 0.0000128
## # … with 31,684 more rows
Now we’ll plot the data.
library(scales)
ggplot(frequency2, aes(x=`Charles Darwin`, y=`Thomas Hunt Morgan`), color=abs(-`Charles Darwin`-`Thomas Hunt Morgan`)) +
geom_abline(color='gray40', lty=2) +
geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
scale_x_log10(labels=percent_format()) +
scale_y_log10(labels=percent_format()) +
scale_color_gradient(limits=c(0,0.001),
low='indianred', high='steelblue') +
theme(legend.position = 'none') +
labs(y='Thomas Hunt Morgan', x='Charles Darwin')
ggplot(frequency2, aes(x = `Thomas Hunt Morgan`, y = `Thomas Henry Huxley`), color = abs(`Thomas Hunt Morgan` -`Thomas Henry Huxley`)) +
geom_abline(color = 'grey40', lty = 2)+
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0,0.001),
low = 'darkslategray4', high = 'gray75') +
theme(legend.position='none') +
labs(y = 'Thomas Henry Huxley', x = 'Thomas Hunt Morgan')
There are a variety of methods and dictionaries that exist for evaluating the opinion or emotion in text.
Some examples:
AFINN: assigns a score between -5 and 5, with negative numbers indicating negative sentiment, and the number itself describing magnitude of the sentiment
bing: categorizes words in a binary fashion into positive or negative
nrc: categorizes words into positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust
The function get_sentiments() allows us to get the specific sentiments lexicon with the measures for each one. Let’s set up objects for our dictionaries/methods and take a look at them.
library(tidytext)
library(textdata)
afinn <- get_sentiments('afinn')
afinn
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
bing <- get_sentiments('bing')
bing
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
nrc <- get_sentiments('nrc')
nrc
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
These libraries were created either using crowdsources, cloud computing/AI like Amazon Mechanical Turk, or by direct labor by an author (validated by crowdsourcing)
Let’s look at the words with a joy score from some of Darwin’s works using nrc. We’ll start by setting up our wordbank from Darwin’s books.
library(gutenbergr)
library(dplyr)
library(stringr)
darwin <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')
tidy_books <- darwin %>%
group_by(gutenberg_id) %>%
mutate(linenumber=row_number(), chapter=cumsum(str_detect(text, regex('^chapter [\\divxlc]', ignore_case=TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_books
## # A tibble: 786,575 × 4
## gutenberg_id linenumber chapter word
## <int> <int> <int> <chr>
## 1 944 1 0 the
## 2 944 1 0 voyage
## 3 944 1 0 of
## 4 944 1 0 the
## 5 944 1 0 beagle
## 6 944 1 0 by
## 7 944 2 0 charles
## 8 944 2 0 darwin
## 9 944 8 0 about
## 10 944 8 0 the
## # … with 786,565 more rows
Let’s add the book name instead of just using the gutenberg ID.
colnames(tidy_books)[1] <- 'book' #first just changing the value name of this column to book
tidy_books$book[tidy_books$book == 944] <- "The Voyage of the Beagle" #set book value of 944 to equal our string
tidy_books$book[tidy_books$book == 1227] <- "The Expression of the Emotions in Man and Animals"
tidy_books$book[tidy_books$book == 1228] <- "On the Origin of Species By Means of Natural Selection"
tidy_books$book[tidy_books$book == 2300] <- "The Descent of Man, and Selection in Relation to Sex"
Now that we have a tidy format with one word per row, we are ready for sentiment analysis.
nrc_joy <- get_sentiments('nrc') %>% #filter out only joy terms from nrc and store in a new object
filter(sentiment == 'joy')
tidy_books %>%
filter(book == 'The Voyage of the Beagle') %>% #only use The Voyage of the Beagle
inner_join(nrc_joy) %>% #only pull out words that match the joy list
count(word, sort = TRUE) #count those words
## # A tibble: 277 × 2
## word n
## <chr> <int>
## 1 found 301
## 2 good 161
## 3 remarkable 114
## 4 green 95
## 5 kind 92
## 6 tree 86
## 7 present 85
## 8 food 78
## 9 beautiful 61
## 10 elevation 60
## # … with 267 more rows
We can also examine how sentiment changes throughout a work. This can also be done over time looking at something like COVID.
library(tidyr)
Charles_Darwin_sentiment <- tidy_books %>%
inner_join(get_sentiments('bing')) %>%
count(book, index=linenumber %/% 80, sentiment) %>% #for each book, index 80 lines into a singular bin. Note %/%, which is interger divison, being used for this
pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
mutate(sentiment=positive-negative)
Now let’s plot it to see how sentiment changes throughout the book.
library(ggplot2)
ggplot(Charles_Darwin_sentiment, aes(index, sentiment, fill=book)) +
geom_col(show.legend=FALSE) +
facet_wrap(~book, ncol=2, scales='free_x')
Note of course that this cannot accurately represent context, sarcasm, and such. For example, “not bad” may be recorded as a negative sentiment despite being a positive sentiment in typical speech. This analysis is very preliminary and rudimentary.
Let’s now compare the three sentiment dictionaries. Each one can serve a different purpose, so it’s important to select one that gives you the information that is most usable to you. Below we’ll go through all the steps for setting up all 3 datasets for each lexicon.
voyage <- tidy_books %>%
filter(book == 'The Voyage of the Beagle')
afinn_voyage <- voyage %>%
inner_join(get_sentiments('afinn')) %>%
group_by(index=linenumber %/% 80) %>% #Again here, we use %/% to say "index every 80 linenumbers into groups"
summarise(sentiment=sum(value)) %>%
mutate(method='AFINN')
bing_nrc <- bind_rows(voyage %>%
inner_join(get_sentiments('bing')) %>%
mutate(method='Bing et al.'),
voyage %>%
inner_join(get_sentiments('nrc') %>%
filter(sentiment %in% c('positive', 'negative'))) %>% #Recall that nrc has many emotion tags; here, we're filtering for only positive/negative for use with bing
mutate(method='nrc')) %>%
count(method, index=linenumber %/% 80, sentiment) %>%
pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>% #Use names from sentiment for table, use n (our count) to get values, and if there's no value, input 0
mutate(sentiment=positive-negative)
We can now estimate the net sentiment (positive - negative) in each chunk of the novel text for each lexicon. Let’s bind it all together and visualize with ggplot.
bind_rows(afinn_voyage, bing_nrc) %>%
ggplot(aes(index, sentiment, fill=method)) +
geom_col(show.legend=TRUE) +
facet_wrap(~method, ncol=1, scales='free_y')
Let’s look at the counts based on each dictionary
get_sentiments('nrc') %>%
filter(sentiment %in% c('positive', 'negative')) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments('bing') %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
We can also sort out the most common words while tagged with their associated sentiments.
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort=TRUE) %>%
ungroup()
bing_word_counts
## # A tibble: 2,492 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 great positive 1226
## 2 well positive 855
## 3 like positive 813
## 4 good positive 487
## 5 doubt negative 414
## 6 wild negative 317
## 7 respect positive 310
## 8 remarkable positive 295
## 9 important positive 281
## 10 bright positive 258
## # … with 2,482 more rows
This can be shown visually via piping straight into ggplot2
bing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n=10) %>% #Similar to binning
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill=sentiment)) +
geom_col(show.legend=FALSE) +
facet_wrap(~sentiment, scale='free_y') +
labs(x='Contribution to Sentiment', y=NULL)
Let’s see if we can spot any anomalies in the dataset. We’ll remove a
couple words that are potentially being misinterpreted by our lexicon;
wild, dark, like, and great might be used in more neutral contexts than
would indicate emotion.
custom_stop_words <- bind_rows(tibble(word = c('wild', 'dark', 'great', 'like'), lexicon = c('custom')), stop_words)
custom_stop_words
## # A tibble: 1,153 × 2
## word lexicon
## <chr> <chr>
## 1 wild custom
## 2 dark custom
## 3 great custom
## 4 like custom
## 5 a SMART
## 6 a's SMART
## 7 able SMART
## 8 about SMART
## 9 above SMART
## 10 according SMART
## # … with 1,143 more rows
You could use the above custom_stop_words lexicon to filter out all of those words plus standard stop words to check if texts are still as positive/negative with those potential misinterpretations removed.
Next, let’s look at Word Clouds.
We can see that tidy text mining and sentiment analysis all works well with ggplot2, but having or data in tidy format leads to other nice graphical techiniques.
library(wordcloud)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words=100))
Let’s also look at comparison.cloud(), which may require turning the dataframe into a matrix. We can do so using the acast() function.
library(reshape2)
tidy_books %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort=TRUE) %>%
acast(word ~ sentiment, value.var='n', fill=0) %>%
comparison.cloud(colors=c('steelblue', 'forestgreen'), max.words = 100)
As you can see, comparison.cloud() gave us a new layer of information by adding colors to define which words are positive and which words are negative. This can not only show us the most common words, but also compare volume of positive to negative words used.
Looking at Units Beyond Words
Lots of useful work can be done by tokenizing at the word level, but sometimes it’s nice to look at different units of text. For example, we can look beyond just unigrams.
Example: I am not having a good day.
Given the presence of “not” and “good”, this could be marked as a neutral sentence/line since “not” would count as negative, and “good” would count as positive. Other processing may be better to find appropriate sentiment.
bingnegative <- get_sentiments('bing') %>%
filter(sentiment == 'negative')
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords=n()) %>%
left_join(wordcounts, by=c('book', 'chapter')) %>%
mutate(ratio=negativewords/words) %>%
filter(chapter!=0) %>%
slice_max(ratio,n=1) %>%
ungroup()
## # A tibble: 4 × 5
## book chapter negat…¹ words ratio
## <chr> <int> <int> <int> <dbl>
## 1 On the Origin of Species By Means of Natural Sel… 3 5 86 0.0581
## 2 The Descent of Man, and Selection in Relation to… 20 4 87 0.0460
## 3 The Expression of the Emotions in Man and Animals 10 249 4220 0.0590
## 4 The Voyage of the Beagle 10 375 11202 0.0335
## # … with abbreviated variable name ¹negativewords
This shows us the most negative chapters of each book, and their ratios of negativity to all other words.
So far we’ve only looked at single words, but many more accurate analyses are based on the relationship between words, allowing for some amount of context in the analysis.
Let’s look at some methods of tidytext for calculating and visualizing word relationships. We’ll begin by loading books from our previous work with word analyses.
library(dplyr)
library(tidytext)
library(gutenbergr)
library(ggplot2)
darwin_books <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')
colnames(darwin_books)[1] <- 'book'
darwin_books$book[darwin_books$book == 944] <- 'The Voyage of the Beagle'
darwin_books$book[darwin_books$book == 1227] <- 'The Expression of the Emotions in Man and Animals'
darwin_books$book[darwin_books$book == 1228] <- 'On the Origin of Species By Means of Natural Selection'
darwin_books$book[darwin_books$book == 2300] <- 'The Descent of Man, and Selection in Relation to Sex'
darwin_bigrams <- darwin_books %>%
unnest_tokens(bigram, text, token='ngrams', n=2)
darwin_bigrams
## # A tibble: 724,531 × 2
## book bigram
## <chr> <chr>
## 1 The Voyage of the Beagle the voyage
## 2 The Voyage of the Beagle voyage of
## 3 The Voyage of the Beagle of the
## 4 The Voyage of the Beagle the beagle
## 5 The Voyage of the Beagle beagle by
## 6 The Voyage of the Beagle charles darwin
## 7 The Voyage of the Beagle <NA>
## 8 The Voyage of the Beagle <NA>
## 9 The Voyage of the Beagle <NA>
## 10 The Voyage of the Beagle <NA>
## # … with 724,521 more rows
This data is still in tidytext format and is structured as one-token-per-row. Each token is a bigram; a two-word combination (overlapping)
Counting and filtering n-grams (n standing for any number):
darwin_bigrams %>%
count(bigram, sort=TRUE)
## # A tibble: 238,516 × 2
## bigram n
## <chr> <int>
## 1 of the 11297
## 2 <NA> 8947
## 3 in the 5257
## 4 on the 4093
## 5 to the 2849
## 6 the same 2048
## 7 that the 1947
## 8 it is 1830
## 9 of a 1610
## 10 and the 1590
## # … with 238,506 more rows
Most of the common bigrams are stop-words. This can be a good time to use tidyr’s separate command which splits a column into two based on a delimiter. This will let us make a column for word one and word two.
library(tidyr)
bigrams_separated <- darwin_bigrams %>%
separate(bigram, c('word1', 'word2'), sep=' ') #This function is effectively saying "look at bigram, separate the columns into word1 and word2, separating the actual data by the space between them"
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) #Filter out all word1/word2 entries that are found in the stop_word list on the word column
bigrams_separated
## # A tibble: 724,531 × 3
## book word1 word2
## <chr> <chr> <chr>
## 1 The Voyage of the Beagle the voyage
## 2 The Voyage of the Beagle voyage of
## 3 The Voyage of the Beagle of the
## 4 The Voyage of the Beagle the beagle
## 5 The Voyage of the Beagle beagle by
## 6 The Voyage of the Beagle charles darwin
## 7 The Voyage of the Beagle <NA> <NA>
## 8 The Voyage of the Beagle <NA> <NA>
## 9 The Voyage of the Beagle <NA> <NA>
## 10 The Voyage of the Beagle <NA> <NA>
## # … with 724,521 more rows
bigrams_filtered
## # A tibble: 94,896 × 3
## book word1 word2
## <chr> <chr> <chr>
## 1 The Voyage of the Beagle charles darwin
## 2 The Voyage of the Beagle <NA> <NA>
## 3 The Voyage of the Beagle <NA> <NA>
## 4 The Voyage of the Beagle <NA> <NA>
## 5 The Voyage of the Beagle <NA> <NA>
## 6 The Voyage of the Beagle <NA> <NA>
## 7 The Voyage of the Beagle online edition
## 8 The Voyage of the Beagle <NA> <NA>
## 9 The Voyage of the Beagle degree symbol
## 10 The Voyage of the Beagle degs italics
## # … with 94,886 more rows
New bigram counts:
bigram_counts <- bigrams_filtered %>%
unite(bigram, word1, word2, sep=' ')
bigram_counts
## # A tibble: 94,896 × 2
## book bigram
## <chr> <chr>
## 1 The Voyage of the Beagle charles darwin
## 2 The Voyage of the Beagle NA NA
## 3 The Voyage of the Beagle NA NA
## 4 The Voyage of the Beagle NA NA
## 5 The Voyage of the Beagle NA NA
## 6 The Voyage of the Beagle NA NA
## 7 The Voyage of the Beagle online edition
## 8 The Voyage of the Beagle NA NA
## 9 The Voyage of the Beagle degree symbol
## 10 The Voyage of the Beagle degs italics
## # … with 94,886 more rows
We may also be interested in trigrams (three-word combos). Let’s set those up.
darwin_trigrams <- darwin_books %>%
unnest_tokens(trigram, text, token='ngrams', n=3) %>%
separate(trigram, c('word1','word2','word3'), sep=' ') %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort=TRUE)
darwin_trigrams
## # A tibble: 19,971 × 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 <NA> <NA> <NA> 9884
## 2 tierra del fuego 92
## 3 secondary sexual characters 91
## 4 captain fitz roy 45
## 5 closely allied species 30
## 6 de la physionomie 30
## 7 domestication vol ii 26
## 8 vol ii pp 22
## 9 vertebrates vol iii 21
## 10 proc zoolog soc 18
## # … with 19,961 more rows
Let’s actually analyze some of our bigrams from earlier.
bigrams_filtered %>%
filter(word2 == 'selection') %>%
count(book, word1, sort=TRUE)
## # A tibble: 39 × 3
## book word1 n
## <chr> <chr> <int>
## 1 The Descent of Man, and Selection in Relation to Sex sexual 254
## 2 On the Origin of Species By Means of Natural Selection natural 250
## 3 The Descent of Man, and Selection in Relation to Sex natural 156
## 4 On the Origin of Species By Means of Natural Selection sexual 18
## 5 On the Origin of Species By Means of Natural Selection continued 6
## 6 The Descent of Man, and Selection in Relation to Sex unconscious 6
## 7 On the Origin of Species By Means of Natural Selection methodical 5
## 8 The Descent of Man, and Selection in Relation to Sex continued 5
## 9 On the Origin of Species By Means of Natural Selection unconscious 4
## 10 The Expression of the Emotions in Man and Animals natural 4
## # … with 29 more rows
#Here we're looking at what single word1 values are most commonly paired with the word 'selection' for each book
Let’s again look at tf-idf across bigrams throughout Darwin’s works; recall that idf defines the rarity/commonality of the word compared to other works (higher number = unique to that book)
bigram_tf_idf <- bigram_counts %>%
count(book, bigram) %>%
bind_tf_idf(bigram, book, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf
## # A tibble: 60,595 × 6
## book bigram n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 The Expression of the Emotions in Man and… nerve… 47 0.00350 1.39 0.00485
## 2 On the Origin of Species By Means of Natu… natur… 250 0.0160 0.288 0.00460
## 3 The Expression of the Emotions in Man and… la ph… 35 0.00260 1.39 0.00361
## 4 The Voyage of the Beagle bueno… 54 0.00245 1.39 0.00339
## 5 The Voyage of the Beagle capta… 53 0.00240 1.39 0.00333
## 6 On the Origin of Species By Means of Natu… glaci… 36 0.00230 1.39 0.00319
## 7 The Voyage of the Beagle fitz … 50 0.00227 1.39 0.00314
## 8 The Expression of the Emotions in Man and… muscl… 30 0.00223 1.39 0.00310
## 9 The Expression of the Emotions in Man and… orbic… 29 0.00216 1.39 0.00299
## 10 The Expression of the Emotions in Man and… dr du… 26 0.00194 1.39 0.00268
## # … with 60,585 more rows
Now let’s graph those relationships, separating by book:
bigram_tf_idf %>%
arrange(desc(tf_idf)) %>%
group_by(book) %>%
slice_max(tf_idf, n=10) %>%
ungroup() %>%
mutate(bigram=reorder(bigram, tf_idf)) %>%
ggplot(aes(tf_idf, bigram, fill=book)) +
geom_col(show.legend=FALSE) +
facet_wrap(~book, ncol=2, scales='free') +
labs(x='tf-idf of bigrams', y=NULL)
Using bigrams to provide context in sentiment analysis
bigrams_separated %>%
filter(word1 == 'not') %>%
count(word1, word2, sort=TRUE)
## # A tibble: 867 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 not be 128
## 2 not have 104
## 3 not only 103
## 4 not a 100
## 5 not to 98
## 6 not been 89
## 7 not the 82
## 8 not at 70
## 9 not know 60
## 10 not so 58
## # … with 857 more rows
By doing sentiment analysis on bigrams, we can examine how often sentiment-associated words are preceded by a modifier like ‘not’ or other negating words. Here’s the sentiment values of words in the afinn lexicon:
afinn <- get_sentiments('afinn')
afinn
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
We can examine the most frequent words that were preceded by “not”, and associate with sentiment.
not_words <- bigrams_separated %>%
filter(word1 == 'not') %>%
inner_join(afinn, by=c(word2='word')) %>%
count(word2, value, sort=TRUE)
not_words
## # A tibble: 114 × 3
## word2 value n
## <chr> <dbl> <int>
## 1 doubt -1 25
## 2 like 2 11
## 3 pretend -1 9
## 4 wish 1 8
## 5 admit -1 7
## 6 difficult -1 5
## 7 easy 1 5
## 8 reach 1 5
## 9 extend 1 4
## 10 forget -1 4
## # … with 104 more rows
Now we can see that “not like” has a positive value, but should likely be counted as negative in-context. The 11 occurrences of this may skew our analysis.
Let’s visualize:
not_words %>%
mutate(contribution=n*value) %>%
arrange(desc(abs(contribution))) %>%
head(20) %>%
mutate(word2=reorder(word2, contribution)) %>%
ggplot(aes(n*value, word2, fill=n*value > 0)) +
geom_col(show.legend=FALSE) +
labs(x='Sentiment Value * Number of Occurences', y='Words Preceded by \"Not\"')
So as we can see, ‘like’ scores very high in its sentiment value effect despite being preceded by ‘not’. This gives us a good impression of what positive words are likely skewing our data in the positive direction due to a lack of bigram context.
negation_words <- c('not', 'no', 'never', 'non', 'without')
negated_words <- bigrams_separated %>% #This whole thing will pull out all word combos using our negation words
filter(word1 %in% negation_words) %>%
inner_join(afinn, by=c(word2='word')) %>%
count(word1, word2, value, sort=TRUE)
negated_words
## # A tibble: 208 × 4
## word1 word2 value n
## <chr> <chr> <dbl> <int>
## 1 no doubt -1 210
## 2 not doubt -1 25
## 3 no great 3 19
## 4 not like 2 11
## 5 not pretend -1 9
## 6 not wish 1 8
## 7 without doubt -1 8
## 8 not admit -1 7
## 9 no greater 3 6
## 10 not difficult -1 5
## # … with 198 more rows
Let’s visualize our negation words
negated_words %>%
mutate(contribution = n*value, word2=reorder(paste(word2, word1, sep='_'), contribution)) %>%
group_by(word1) %>%
slice_max(abs(contribution), n=12, with_ties=FALSE) %>%
ggplot(aes(word2, contribution, fill=n*value>0)) +
geom_col(show.legend=FALSE) +
facet_wrap(~word1, scales='free') +
scale_x_discrete(labels=function(x) gsub('_.+$', '', x)) +
xlab('Words Preceded by Negation Term') +
ylab('Sentiment Value * Number of Occurences') +
coord_flip()
Let’s visualize a network of bigrams using ggraph:
library(igraph)
bigram_counts2 <- bigrams_filtered %>%
count(word1, word2, sort=TRUE)
bigram_graph <- bigram_counts2 %>%
filter(n>20) %>%
graph_from_data_frame()
bigram_graph
## IGRAPH 2b7850a DN-- 203 140 --
## + attr: name (v/c), n (e/n)
## + edges from 2b7850a (vertex names):
## [1] NA ->NA natural ->selection sexual ->selection
## [4] vol ->ii lower ->animals sexual ->differences
## [7] south ->america distinct ->species secondary ->sexual
## [10] breeding ->season closely ->allied sexual ->characters
## [13] tierra ->del del ->fuego vol ->iii
## [16] de ->la natural ->history fresh ->water
## [19] north ->america bright ->colours sexual ->difference
## [22] allied ->species tail ->feathers strongly ->marked
## + ... omitted several edges
library(ggraph)
set.seed(1234)
ggraph(bigram_graph, layout='fr') +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label=name), vjust=1, hjust=1)
We can also add directionality to this network
set.seed(1234)
a <- grid::arrow(type='closed', length=unit(0.15, 'inches'))
ggraph(bigram_graph, layout='fr') +
geom_edge_link(aes(edge_alpha=n), show.legend=FALSE, arrow=a, end_cap=circle(0.07, 'inches')) +
geom_node_point(color='lightblue', size=3) +
geom_node_text(aes(label=name), vjust=1, hjust=1) +
theme_void()
As you can see, the new graph gives us arrows that show the directionality of these word relationships, with darker arrows making for stronger connections.
A central question in text mining is how to quantify what a document is about. We can do this by looking at words that make up the document and measuring term frequency.
There are a lot of words that may not be important; these are our stop words that we’re familiar with. One way to remedy this is to look at inverse document frequency words, which decreases the weight for commonly used words and increases the weight for words that are more specific.
Let’s look at term frequency from Darwin’s works
library(dplyr)
library(tidytext)
library(gutenbergr)
book_words <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = 'https://mirror2.sandyriver.net/pub/gutenberg')
colnames(book_words)[1] <- 'book'
book_words$book[book_words$book == 944] <- "The Voyage of the Beagle"
book_words$book[book_words$book == 1227] <- "The Expression of the Emotions in Man and Animals"
book_words$book[book_words$book == 1228] <- "On the Origin of Species By Means of Natural Selection"
book_words$book[book_words$book == 2300] <- "The Descent of Man, and Selection in Relation to Sex"
Now let’s disect our text
book_words <- book_words %>%
unnest_tokens(word, text) %>%
count(book, word, sort=TRUE)
book_words
## # A tibble: 43,024 × 3
## book word n
## <chr> <chr> <int>
## 1 The Descent of Man, and Selection in Relation to Sex the 25490
## 2 The Voyage of the Beagle the 16930
## 3 The Descent of Man, and Selection in Relation to Sex of 16762
## 4 On the Origin of Species By Means of Natural Selection the 10301
## 5 The Voyage of the Beagle of 9438
## 6 The Descent of Man, and Selection in Relation to Sex in 8882
## 7 The Expression of the Emotions in Man and Animals the 8045
## 8 On the Origin of Species By Means of Natural Selection of 7864
## 9 The Descent of Man, and Selection in Relation to Sex and 7854
## 10 The Descent of Man, and Selection in Relation to Sex to 5901
## # … with 43,014 more rows
As you can see, the words as they are just say nothing. Let’s also get the totals for every book.
book_words$n <- as.numeric(book_words$n)
total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))
book_words
## # A tibble: 43,024 × 3
## book word n
## <chr> <chr> <dbl>
## 1 The Descent of Man, and Selection in Relation to Sex the 25490
## 2 The Voyage of the Beagle the 16930
## 3 The Descent of Man, and Selection in Relation to Sex of 16762
## 4 On the Origin of Species By Means of Natural Selection the 10301
## 5 The Voyage of the Beagle of 9438
## 6 The Descent of Man, and Selection in Relation to Sex in 8882
## 7 The Expression of the Emotions in Man and Animals the 8045
## 8 On the Origin of Species By Means of Natural Selection of 7864
## 9 The Descent of Man, and Selection in Relation to Sex and 7854
## 10 The Descent of Man, and Selection in Relation to Sex to 5901
## # … with 43,014 more rows
Now let’s join our book_words and totals.
book_words <- left_join(book_words, total_words)
book_words
## # A tibble: 43,024 × 4
## book word n total
## <chr> <chr> <dbl> <dbl>
## 1 The Descent of Man, and Selection in Relation to Sex the 25490 311041
## 2 The Voyage of the Beagle the 16930 208118
## 3 The Descent of Man, and Selection in Relation to Sex of 16762 311041
## 4 On the Origin of Species By Means of Natural Selection the 10301 157002
## 5 The Voyage of the Beagle of 9438 208118
## 6 The Descent of Man, and Selection in Relation to Sex in 8882 311041
## 7 The Expression of the Emotions in Man and Animals the 8045 110414
## 8 On the Origin of Species By Means of Natural Selection of 7864 157002
## 9 The Descent of Man, and Selection in Relation to Sex and 7854 311041
## 10 The Descent of Man, and Selection in Relation to Sex to 5901 311041
## # … with 43,014 more rows
As you can see, the usual suspects are the most common words and still don’t tell us anything.
Let’s try to fix that.
library(ggplot2)
ggplot(book_words, aes(n/total, fill=book)) +
geom_histogram(show.legend=FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~book, ncol=2, scales='free_y')
Zipf’s Law - The frequency that a word appears is inversely proportional to its rank. Let’s create a dataset that applies a rank to our words based on Zipf’s law.
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank=row_number(),
'term frequency'=n/total) %>%
ungroup()
freq_by_rank
## # A tibble: 43,024 × 6
## book word n total rank term …¹
## <chr> <chr> <dbl> <dbl> <int> <dbl>
## 1 The Descent of Man, and Selection in Relati… the 25490 311041 1 0.0820
## 2 The Voyage of the Beagle the 16930 208118 1 0.0813
## 3 The Descent of Man, and Selection in Relati… of 16762 311041 2 0.0539
## 4 On the Origin of Species By Means of Natura… the 10301 157002 1 0.0656
## 5 The Voyage of the Beagle of 9438 208118 2 0.0453
## 6 The Descent of Man, and Selection in Relati… in 8882 311041 3 0.0286
## 7 The Expression of the Emotions in Man and A… the 8045 110414 1 0.0729
## 8 On the Origin of Species By Means of Natura… of 7864 157002 2 0.0501
## 9 The Descent of Man, and Selection in Relati… and 7854 311041 4 0.0253
## 10 The Descent of Man, and Selection in Relati… to 5901 311041 5 0.0190
## # … with 43,014 more rows, and abbreviated variable name ¹`term frequency`
Now let’s plot it proper:
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color=book)) +
geom_line(size=1.1, alpha=0.8, show.legend=FALSE) +
scale_x_log10() +
scale_y_log10()
Lets us TF - IDF to find words for each document by decreasing the weight for commonly used words and increasing the weight for words less used
book_tf_idf <- book_words %>%
bind_tf_idf(word, book, n)
book_tf_idf
## # A tibble: 43,024 × 7
## book word n total tf idf tf_idf
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 The Descent of Man, and Selection in … the 25490 311041 0.0820 0 0
## 2 The Voyage of the Beagle the 16930 208118 0.0813 0 0
## 3 The Descent of Man, and Selection in … of 16762 311041 0.0539 0 0
## 4 On the Origin of Species By Means of … the 10301 157002 0.0656 0 0
## 5 The Voyage of the Beagle of 9438 208118 0.0453 0 0
## 6 The Descent of Man, and Selection in … in 8882 311041 0.0286 0 0
## 7 The Expression of the Emotions in Man… the 8045 110414 0.0729 0 0
## 8 On the Origin of Species By Means of … of 7864 157002 0.0501 0 0
## 9 The Descent of Man, and Selection in … and 7854 311041 0.0253 0 0
## 10 The Descent of Man, and Selection in … to 5901 311041 0.0190 0 0
## # … with 43,014 more rows
Now let’s look at specifically terms with a high tf-idf.
book_tf_idf %>%
select(-total) %>%
arrange(desc(tf_idf))
## # A tibble: 43,024 × 6
## book word n tf idf tf_idf
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 The Expression of the Emotions in Man and … tears 126 1.14e-3 1.39 1.58e-3
## 2 The Expression of the Emotions in Man and … blush 114 1.03e-3 1.39 1.43e-3
## 3 The Expression of the Emotions in Man and … eyeb… 149 1.35e-3 0.693 9.35e-4
## 4 The Voyage of the Beagle degs 117 5.62e-4 1.39 7.79e-4
## 5 On the Origin of Species By Means of Natur… sele… 412 2.62e-3 0.288 7.55e-4
## 6 The Descent of Man, and Selection in Relat… sexu… 745 2.40e-3 0.288 6.89e-4
## 7 The Descent of Man, and Selection in Relat… shewn 143 4.60e-4 1.39 6.37e-4
## 8 On the Origin of Species By Means of Natur… hybr… 133 8.47e-4 0.693 5.87e-4
## 9 The Expression of the Emotions in Man and … frown 46 4.17e-4 1.39 5.78e-4
## 10 The Descent of Man, and Selection in Relat… sele… 621 2.00e-3 0.288 5.74e-4
## # … with 43,014 more rows
Let’s look at a visualization for these high tf-idf words.
library(forcats)
book_tf_idf %>%
group_by(book) %>%
slice_max(tf_idf, n=15) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill=book)) +
geom_col(show.legend=FALSE) +
facet_wrap(~book, ncol=2, scales='free') +
labs(x='tf-idf',y=NULL)
Text mining as outlined in the previous section has many common uses in today’s modern social media landscape. Companies, political parties, and more utilize text mining alongside APIs to measure public opions on current issues, their brand or identity, and more. Knowing how to use these APIs and analyze data pulled from them is a marketable skill in more ways than one.
Downloading using the Twitter API requires a sign-up process that grants you a specific token you utilize to pull tweets from certain date ranges. Below is a full example of how you might pull data from Twitter, but all of the code is commented out since the token is likely to expired by the time this document is read. Instead, for the next section, pre-downloaded files will be used.
#library(academictwitteR) #This package allows R to interface easily with the Twitter API
#bearer_token <- "AAAAAAAAAAAAAAAAAAAAAByFhgEAAAAAl4Ql0CyArwygcl0FewUO5ARSaAQ%3DvJpyDIOBKBVdtgnEv7wrpdwoeI$wbQLbxfSYElUGZuwix0kbQm"
#tweets20US <-
# get_all_tweets(
# 'covid-19 has:geo, #This part requires any pulled tweets to include geolocation data
# '2020-06-01T01:00:002', #This part specifies when to begin pulling tweets
# '2020-06-08T01:00:002', #This part specifies when to stop pulling tweets
# bearer_token, #This is the token that we stored from above
# n=5000, #This is the number of tweets to pull
# country='US' #And here is the country we want to pull from
# )
#tweets21US <-
# get_all_tweets(
# 'covid-19 has:geo,
# '2021-01-01T01:00:002',
# '2021-01-08T01:00:002',
# bearer_token,
# n=5000,
# country='US'
# )
#tweets22US <-
# get_all_tweets(
# 'covid-19 has:geo,
# '2022-01-01T01:00:002',
# '2022-01-08T01:00:002',
# bearer_token,
# n=5000,
# country='US'
# )
In the next section, we’ll look at data from the above year ranges, as well as data from Great Britain and Canada.
First things first, we need to load our data & required libraries:
library(lubridate)
library(ggplot2)
library(dplyr)
library(readr)
library(tidytext)
library(stringr)
library(tidyr)
library(textdata)
library(reshape2)
library(scales)
tweets20CA <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets20CA.csv', row.names=1)
tweets21CA <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets21CA.csv', row.names=1)
tweets22CA <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets22CA.csv', row.names=1)
tweets20GB <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets20GB.csv', row.names=1)
tweets21GB <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets21GB.csv', row.names=1)
tweets22GB <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets22GB.csv', row.names=1)
tweets20US <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets20us.csv', row.names=1)
tweets21US <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets21us.csv', row.names=1)
tweets22US <- read.csv('~/School/Fall 2022/Bioinf/R Files/Twitter API Text Mining/tweets22us.csv', row.names=1)
Next we’ll combine our data for each country, sorting by year:
tweetsUS <- bind_rows(tweets21US %>%
mutate(year='y2021'),
tweets22US %>%
mutate(year='y2022'),
tweets20US %>%
mutate(year='y2020')) %>%
mutate(timestamp=ymd_hms(created_at))
tweetsGB <- bind_rows(tweets21GB %>%
mutate(year='y2021'),
tweets22GB %>%
mutate(year='y2022'),
tweets20GB %>%
mutate(year='y2020')) %>%
mutate(timestamp=ymd_hms(created_at))
tweetsCA <- bind_rows(tweets21CA %>%
mutate(year='y2021'),
tweets22CA %>%
mutate(year='y2022'),
tweets20CA %>%
mutate(year='y2020')) %>%
mutate(timestamp=ymd_hms(created_at))
And now we’ll begin cleaning our data. We’ll remove a bunch of messy annotation artifacts, as well as stop words to keep only the sentiment-relevant components.
remove_reg <- '&|<|>'
tidy_tweetsUS <- tweetsUS %>%
filter(!str_detect(text, '^RT')) %>%
mutate(text=str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token='tweets') %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, '[a-z]'))
tidy_tweetsGB <- tweetsGB %>%
filter(!str_detect(text, '^RT')) %>%
mutate(text=str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token='tweets') %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, '[a-z]'))
tidy_tweetsCA <- tweetsCA %>%
filter(!str_detect(text, '^RT')) %>%
mutate(text=str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token='tweets') %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, '[a-z]'))
Now let’s visualize our data using bar graphs.
ggplot(tweetsUS, aes(x=timestamp, fill=year)) +
geom_histogram(position='identity', bins=20, show.legend=FALSE) +
facet_wrap(~year, ncol=1)
ggplot(tweetsGB, aes(x=timestamp, fill=year)) +
geom_histogram(position='identity', bins=20, show.legend=FALSE) +
facet_wrap(~year, ncol=1)
ggplot(tweetsCA, aes(x=timestamp, fill=year)) +
geom_histogram(position='identity', bins=20, show.legend=FALSE) +
facet_wrap(~year, ncol=1)
Let’s view the word frequencies for our countries.
frequencyUS <- tidy_tweetsUS %>%
count(year, word, sort=TRUE) %>%
left_join(tidy_tweetsUS %>%
count(year, name='total')) %>%
mutate(freq=n/total)
## Joining, by = "year"
frequencyGB <- tidy_tweetsGB %>%
count(year, word, sort=TRUE) %>%
left_join(tidy_tweetsGB %>%
count(year, name='total')) %>%
mutate(freq=n/total)
## Joining, by = "year"
frequencyCA <- tidy_tweetsCA %>%
count(year, word, sort=TRUE) %>%
left_join(tidy_tweetsCA %>%
count(year, name='total')) %>%
mutate(freq=n/total)
## Joining, by = "year"
head(frequencyUS)
## year word n total freq
## 1 y2021 covid 4431 66977 0.066157039
## 2 y2020 covid 4339 67422 0.064355848
## 3 y2022 covid 3157 57060 0.055327725
## 4 y2022 #covid19 894 57060 0.015667718
## 5 y2021 vaccine 662 66977 0.009883990
## 6 y2021 people 458 66977 0.006838168
head(frequencyGB)
## year word n total freq
## 1 y2020 covid 3757 71255 0.052726124
## 2 y2021 covid 3274 53897 0.060745496
## 3 y2020 #covid19 889 71255 0.012476317
## 4 y2022 covid 782 17402 0.044937364
## 5 y2020 people 556 71255 0.007802961
## 6 y2021 #covid19 442 53897 0.008200828
head(frequencyCA)
## year word n total freq
## 1 y2020 covid 2419 48348 0.05003309
## 2 y2021 covid 1702 31882 0.05338435
## 3 y2022 covid 788 17302 0.04554387
## 4 y2020 stigma 559 48348 0.01156201
## 5 y2020 health 438 48348 0.00905932
## 6 y2020 #covid19 435 48348 0.00899727
Let’s now get the frequency by year for each word.
frequencyUS2 <- frequencyUS %>%
select(year, word, freq) %>%
pivot_wider(names_from=year, values_from=freq) %>%
arrange('y2020', 'y2021', 'y2022')
frequencyGB2 <- frequencyGB %>%
select(year, word, freq) %>%
pivot_wider(names_from=year, values_from=freq) %>%
arrange('y2020', 'y2021', 'y2022')
frequencyCA2 <- frequencyCA %>%
select(year, word, freq) %>%
pivot_wider(names_from=year, values_from=freq) %>%
arrange('y2020', 'y2021', 'y2022')
head(frequencyUS2)
## # A tibble: 6 × 4
## word y2021 y2020 y2022
## <chr> <dbl> <dbl> <dbl>
## 1 covid 0.0662 0.0644 0.0553
## 2 #covid19 0.00527 0.00577 0.0157
## 3 vaccine 0.00988 0.000519 0.00435
## 4 people 0.00684 0.00662 0.00605
## 5 testing 0.00148 0.00191 0.00619
## 6 de 0.00337 0.00393 0.00547
head(frequencyGB2)
## # A tibble: 6 × 4
## word y2020 y2021 y2022
## <chr> <dbl> <dbl> <dbl>
## 1 covid 0.0527 0.0607 0.0449
## 2 #covid19 0.0125 0.00820 0.0220
## 3 people 0.00780 0.00759 0.00684
## 4 vaccine 0.000463 0.00727 0.00471
## 5 uk 0.00547 0.00638 0.00356
## 6 vaccination 0.0000140 0.00505 0.00218
head(frequencyCA2)
## # A tibble: 6 × 4
## word y2020 y2021 y2022
## <chr> <dbl> <dbl> <dbl>
## 1 covid 0.0500 0.0534 0.0455
## 2 stigma 0.0116 0.00982 0.000462
## 3 health 0.00906 0.0127 0.00520
## 4 #covid19 0.00900 0.00489 0.0145
## 5 fighting 0.00819 0.00910 0.000520
## 6 de 0.00780 0.00794 0.00647
Let’s visualize these frequencies by word. Note that you could make this more readable if you reduced the number of words being analyzed.
ggplot(frequencyUS2, aes(2020, 2022)) +
geom_jitter(alpha=0.05, size=2.5, width=0.25, height=0.25) +
geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
scale_x_log10(labels=percent_format()) +
scale_y_log10(labels=percent_format()) +
geom_abline(color='forestgreen')
Let’s now try filtering our US data by nrc’s “joy” coded words.
nrc_joy <- get_sentiments('nrc') %>%
filter(sentiment == 'joy')
tidy_tweetsUS %>%
filter(year=='y2020') %>%
inner_join(nrc_joy) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## word n
## 1 safe 93
## 2 money 64
## 3 hope 56
## 4 food 50
## 5 god 45
## 6 love 43
## 7 vote 41
## 8 share 36
## 9 happy 32
## 10 true 32
## 11 football 27
## 12 found 27
## 13 peaceful 27
## 14 pay 26
## 15 resources 26
## 16 church 25
## 17 friend 25
## 18 rising 23
## 19 proud 21
## 20 mother 20
## 21 excited 19
## 22 feeling 19
## 23 respect 19
## 24 beach 18
## 25 graduation 18
## 26 treat 18
## 27 beautiful 17
## 28 finally 17
## 29 pray 17
## 30 pretty 17
## 31 deal 16
## 32 fun 16
## 33 hero 16
## 34 special 16
## 35 birthday 15
## 36 favorite 15
## 37 providing 14
## 38 alive 13
## 39 create 13
## 40 income 13
## 41 star 13
## 42 success 13
## 43 baby 12
## 44 enjoy 12
## 45 peace 12
## 46 youth 12
## 47 bless 11
## 48 closure 11
## 49 excellent 11
## 50 glad 11
## 51 luck 11
## 52 save 11
## 53 sun 11
## 54 wonderful 11
## 55 beer 10
## 56 pride 10
## 57 cash 9
## 58 celebration 9
## 59 ceremony 9
## 60 child 9
## 61 jump 9
## 62 perfect 9
## 63 praise 9
## 64 advance 8
## 65 clean 8
## 66 encourage 8
## 67 thankful 8
## 68 award 7
## 69 faith 7
## 70 freedom 7
## 71 grant 7
## 72 humanity 7
## 73 daughter 6
## 74 friendly 6
## 75 garden 6
## 76 healing 6
## 77 joy 6
## 78 labor 6
## 79 progress 6
## 80 art 5
## 81 basketball 5
## 82 celebrating 5
## 83 content 5
## 84 dance 5
## 85 equality 5
## 86 generous 5
## 87 green 5
## 88 helpful 5
## 89 improve 5
## 90 laughing 5
## 91 loving 5
## 92 miracle 5
## 93 music 5
## 94 saint 5
## 95 supremacy 5
## 96 thriving 5
## 97 vacation 5
## 98 wealth 5
## 99 advocacy 4
## 100 birth 4
## 101 blessed 4
## 102 celebrated 4
## 103 charity 4
## 104 delicious 4
## 105 gain 4
## 106 heal 4
## 107 heartfelt 4
## 108 laugh 4
## 109 majority 4
## 110 organization 4
## 111 pleased 4
## 112 powerful 4
## 113 promise 4
## 114 successful 4
## 115 surprise 4
## 116 winning 4
## 117 appreciation 3
## 118 beauty 3
## 119 blessing 3
## 120 bonus 3
## 121 cathedral 3
## 122 childhood 3
## 123 comfort 3
## 124 endless 3
## 125 festival 3
## 126 freely 3
## 127 grow 3
## 128 inauguration 3
## 129 parade 3
## 130 retirement 3
## 131 score 3
## 132 smiling 3
## 133 spa 3
## 134 sweet 3
## 135 swim 3
## 136 unexpected 3
## 137 visionary 3
## 138 visitor 3
## 139 volunteer 3
## 140 weight 3
## 141 accomplished 2
## 142 approve 2
## 143 candid 2
## 144 cheering 2
## 145 closeness 2
## 146 cream 2
## 147 elite 2
## 148 engaged 2
## 149 enjoying 2
## 150 entertainment 2
## 151 exciting 2
## 152 fitting 2
## 153 glory 2
## 154 gorgeous 2
## 155 grandchildren 2
## 156 holiday 2
## 157 hopeful 2
## 158 hug 2
## 159 improvement 2
## 160 inspired 2
## 161 intimately 2
## 162 invite 2
## 163 journey 2
## 164 joyous 2
## 165 kudos 2
## 166 liquor 2
## 167 lucky 2
## 168 marvelous 2
## 169 passionate 2
## 170 practiced 2
## 171 ribbon 2
## 172 salute 2
## 173 sex 2
## 174 shopping 2
## 175 spirits 2
## 176 succeed 2
## 177 treasure 2
## 178 victory 2
## 179 wages 2
## 180 welcomed 2
## 181 winner 2
## 182 abundance 1
## 183 accomplish 1
## 184 adoration 1
## 185 amen 1
## 186 amusement 1
## 187 banquet 1
## 188 blessings 1
## 189 bloom 1
## 190 bridal 1
## 191 brilliant 1
## 192 buddy 1
## 193 chant 1
## 194 charitable 1
## 195 chocolate 1
## 196 clown 1
## 197 communion 1
## 198 compliment 1
## 199 confidence 1
## 200 confident 1
## 201 dawn 1
## 202 embrace 1
## 203 enlighten 1
## 204 entertaining 1
## 205 erotic 1
## 206 exceed 1
## 207 fancy 1
## 208 gift 1
## 209 giggle 1
## 210 glimmer 1
## 211 gratitude 1
## 212 happiness 1
## 213 healthful 1
## 214 heroic 1
## 215 heyday 1
## 216 hilarious 1
## 217 hire 1
## 218 honest 1
## 219 humanitarian 1
## 220 independence 1
## 221 inspiration 1
## 222 inspire 1
## 223 intelligence 1
## 224 kiss 1
## 225 liberate 1
## 226 liberty 1
## 227 loyal 1
## 228 luxury 1
## 229 majestic 1
## 230 marriage 1
## 231 marry 1
## 232 massage 1
## 233 medal 1
## 234 ministry 1
## 235 musical 1
## 236 notable 1
## 237 opera 1
## 238 orchestra 1
## 239 outstanding 1
## 240 passion 1
## 241 pastor 1
## 242 pleasant 1
## 243 praised 1
## 244 precious 1
## 245 privileged 1
## 246 readiness 1
## 247 receiving 1
## 248 recreation 1
## 249 rejoice 1
## 250 repay 1
## 251 reproductive 1
## 252 rescue 1
## 253 revel 1
## 254 reward 1
## 255 romance 1
## 256 salary 1
## 257 satisfied 1
## 258 scholarship 1
## 259 shining 1
## 260 silly 1
## 261 smile 1
## 262 spouse 1
## 263 sunny 1
## 264 sunshine 1
## 265 superman 1
## 266 supporter 1
## 267 sweets 1
## 268 symphony 1
## 269 teach 1
## 270 tender 1
## 271 tranquility 1
## 272 trophy 1
## 273 undying 1
## 274 vivid 1
## 275 wonderfully 1
## 276 worship 1
We can do the same with “anger” coded words. This time we’ll use our data from Britain.
nrc_anger <- get_sentiments('nrc') %>%
filter(sentiment=='anger')
tidy_tweetsGB %>%
filter(year=='y2022') %>%
inner_join(nrc_anger) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## word n
## 1 death 26
## 2 bad 18
## 3 jab 15
## 4 disease 12
## 5 feeling 9
## 6 hit 8
## 7 cancer 7
## 8 deserve 7
## 9 dying 7
## 10 killing 7
## 11 money 7
## 12 bloody 6
## 13 fight 6
## 14 poverty 6
## 15 smell 6
## 16 deadly 5
## 17 fear 5
## 18 losing 5
## 19 politics 5
## 20 shit 5
## 21 shot 5
## 22 treat 5
## 23 words 5
## 24 attack 4
## 25 awful 4
## 26 blame 4
## 27 court 4
## 28 disruption 4
## 29 fighting 4
## 30 lying 4
## 31 mad 4
## 32 mortality 4
## 33 opposed 4
## 34 resign 4
## 35 threat 4
## 36 adverse 3
## 37 broken 3
## 38 clash 3
## 39 damage 3
## 40 hate 3
## 41 insignificant 3
## 42 loss 3
## 43 opposition 3
## 44 shock 3
## 45 anxiety 2
## 46 argument 2
## 47 assault 2
## 48 chaos 2
## 49 combat 2
## 50 cutting 2
## 51 delay 2
## 52 deny 2
## 53 disagree 2
## 54 disaster 2
## 55 enforce 2
## 56 harry 2
## 57 horrible 2
## 58 kick 2
## 59 misleading 2
## 60 morbidity 2
## 61 react 2
## 62 shortage 2
## 63 shout 2
## 64 sore 2
## 65 storm 2
## 66 victim 2
## 67 youth 2
## 68 abandoned 1
## 69 abuse 1
## 70 accused 1
## 71 aggression 1
## 72 armed 1
## 73 arrogant 1
## 74 bear 1
## 75 bitch 1
## 76 campaigning 1
## 77 caution 1
## 78 celebrity 1
## 79 challenge 1
## 80 communism 1
## 81 confinement 1
## 82 confusion 1
## 83 contempt 1
## 84 crazy 1
## 85 criminal 1
## 86 defy 1
## 87 demand 1
## 88 despair 1
## 89 diabolical 1
## 90 disgrace 1
## 91 disgraceful 1
## 92 disinformation 1
## 93 displaced 1
## 94 disrespect 1
## 95 epidemic 1
## 96 escalate 1
## 97 evil 1
## 98 exile 1
## 99 failing 1
## 100 fee 1
## 101 fraud 1
## 102 frustrated 1
## 103 grab 1
## 104 grim 1
## 105 guilty 1
## 106 hot 1
## 107 insanity 1
## 108 insult 1
## 109 invasion 1
## 110 limited 1
## 111 lonely 1
## 112 lose 1
## 113 lunatic 1
## 114 madness 1
## 115 manslaughter 1
## 116 misery 1
## 117 mocking 1
## 118 morals 1
## 119 mug 1
## 120 offense 1
## 121 offensive 1
## 122 persecution 1
## 123 pretending 1
## 124 prick 1
## 125 prison 1
## 126 psychosis 1
## 127 punch 1
## 128 rage 1
## 129 rail 1
## 130 rating 1
## 131 resentment 1
## 132 retract 1
## 133 ridicule 1
## 134 sarcasm 1
## 135 savage 1
## 136 sentence 1
## 137 socialist 1
## 138 strike 1
## 139 terrible 1
## 140 thief 1
## 141 traitor 1
## 142 violent 1
Lastly, let’s try “fear” coded words for Canada.
nrc_fear <- get_sentiments('nrc') %>%
filter(sentiment == 'fear')
tidy_tweetsCA %>%
filter(year=='y2020') %>%
inner_join(nrc_fear) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## word n
## 1 stigma 559
## 2 pandemic 228
## 3 death 87
## 4 medical 77
## 5 emergency 65
## 6 government 64
## 7 hospital 52
## 8 risk 49
## 9 fight 44
## 10 change 37
## 11 disease 27
## 12 fear 26
## 13 spike 21
## 14 force 19
## 15 prevent 19
## 16 watch 18
## 17 die 16
## 18 discrimination 16
## 19 avoid 15
## 20 bad 15
## 21 infection 15
## 22 quarantine 15
## 23 surveillance 15
## 24 police 14
## 25 remains 14
## 26 challenge 13
## 27 delay 13
## 28 youth 13
## 29 god 12
## 30 illness 11
## 31 deadly 10
## 32 flu 10
## 33 confinement 9
## 34 crazy 9
## 35 dying 9
## 36 infectious 9
## 37 concerned 8
## 38 difficult 8
## 39 epidemic 8
## 40 feeling 8
## 41 inequality 8
## 42 lose 8
## 43 prison 8
## 44 warn 8
## 45 worry 8
## 46 afraid 7
## 47 caution 7
## 48 dangerous 7
## 49 disaster 7
## 50 poverty 7
## 51 warning 7
## 52 worse 7
## 53 abuse 6
## 54 forced 6
## 55 lines 6
## 56 treat 6
## 57 war 6
## 58 cancer 5
## 59 court 5
## 60 danger 5
## 61 failure 5
## 62 fire 5
## 63 hearing 5
## 64 struggle 5
## 65 surprise 5
## 66 violence 5
## 67 warden 5
## 68 abandon 4
## 69 advance 4
## 70 alarm 4
## 71 combat 4
## 72 confidence 4
## 73 contagion 4
## 74 contagious 4
## 75 disabled 4
## 76 doubt 4
## 77 enforce 4
## 78 guard 4
## 79 khan 4
## 80 loss 4
## 81 military 4
## 82 murder 4
## 83 nervous 4
## 84 retirement 4
## 85 rule 4
## 86 unsafe 4
## 87 urgent 4
## 88 warned 4
## 89 worship 4
## 90 anxiety 3
## 91 bear 3
## 92 brutality 3
## 93 burke 3
## 94 compassion 3
## 95 confusion 3
## 96 conspiracy 3
## 97 cross 3
## 98 diagnosis 3
## 99 dragon 3
## 100 examination 3
## 101 graduation 3
## 102 hate 3
## 103 illegal 3
## 104 insecurity 3
## 105 lawyer 3
## 106 missing 3
## 107 painful 3
## 108 polio 3
## 109 stroke 3
## 110 suffering 3
## 111 suspension 3
## 112 terrible 3
## 113 volunteer 3
## 114 accused 2
## 115 apparition 2
## 116 assault 2
## 117 beware 2
## 118 bitch 2
## 119 blues 2
## 120 broke 2
## 121 broken 2
## 122 cash 2
## 123 chaos 2
## 124 contaminated 2
## 125 cruel 2
## 126 crushed 2
## 127 dementia 2
## 128 destroying 2
## 129 dictatorship 2
## 130 disinformation 2
## 131 disruption 2
## 132 failing 2
## 133 fearful 2
## 134 fever 2
## 135 flood 2
## 136 fragile 2
## 137 gore 2
## 138 hanging 2
## 139 hunting 2
## 140 injury 2
## 141 intelligence 2
## 142 journey 2
## 143 kill 2
## 144 killing 2
## 145 loyal 2
## 146 manifestation 2
## 147 mortality 2
## 148 pain 2
## 149 plea 2
## 150 procedure 2
## 151 punitive 2
## 152 remove 2
## 153 shame 2
## 154 shoot 2
## 155 shooting 2
## 156 shot 2
## 157 shrink 2
## 158 suicide 2
## 159 suspect 2
## 160 threaten 2
## 161 tragedy 2
## 162 unexpected 2
## 163 unknown 2
## 164 unruly 2
## 165 wary 2
## 166 weight 2
## 167 abandoned 1
## 168 accident 1
## 169 aftermath 1
## 170 agonizing 1
## 171 alarming 1
## 172 ambulance 1
## 173 anarchy 1
## 174 armed 1
## 175 attack 1
## 176 attorney 1
## 177 autopsy 1
## 178 avoiding 1
## 179 awful 1
## 180 bacteria 1
## 181 beating 1
## 182 bigoted 1
## 183 birth 1
## 184 bloody 1
## 185 bombardment 1
## 186 brutal 1
## 187 buck 1
## 188 bully 1
## 189 burial 1
## 190 cabal 1
## 191 cautious 1
## 192 cemetery 1
## 193 cliff 1
## 194 concealed 1
## 195 conflict 1
## 196 coward 1
## 197 crash 1
## 198 criminal 1
## 199 cyclone 1
## 200 dart 1
## 201 default 1
## 202 defend 1
## 203 defense 1
## 204 deleterious 1
## 205 demise 1
## 206 dentistry 1
## 207 destination 1
## 208 destroyed 1
## 209 deterioration 1
## 210 devastating 1
## 211 dire 1
## 212 disappear 1
## 213 disastrous 1
## 214 discipline 1
## 215 discourage 1
## 216 disgusting 1
## 217 dismissal 1
## 218 distressed 1
## 219 divorce 1
## 220 dominant 1
## 221 endless 1
## 222 erase 1
## 223 exorcism 1
## 224 explosive 1
## 225 fatality 1
## 226 fled 1
## 227 flying 1
## 228 forgotten 1
## 229 gang 1
## 230 giant 1
## 231 grave 1
## 232 gun 1
## 233 harm 1
## 234 hide 1
## 235 hiding 1
## 236 holocaust 1
## 237 homeless 1
## 238 honest 1
## 239 horrible 1
## 240 horrific 1
## 241 horrors 1
## 242 hostage 1
## 243 hurt 1
## 244 immigrant 1
## 245 incarceration 1
## 246 inmate 1
## 247 insane 1
## 248 insanity 1
## 249 insecure 1
## 250 jail 1
## 251 lawsuit 1
## 252 leprosy 1
## 253 loom 1
## 254 lynch 1
## 255 mad 1
## 256 madness 1
## 257 mage 1
## 258 manipulation 1
## 259 measles 1
## 260 mob 1
## 261 mortgage 1
## 262 newcomer 1
## 263 operation 1
## 264 opponent 1
## 265 opposed 1
## 266 ordnance 1
## 267 panic 1
## 268 paranoia 1
## 269 pare 1
## 270 penalty 1
## 271 plight 1
## 272 pneumonia 1
## 273 poisoned 1
## 274 powerful 1
## 275 pray 1
## 276 punish 1
## 277 punishing 1
## 278 rat 1
## 279 ravine 1
## 280 recession 1
## 281 reckless 1
## 282 regulatory 1
## 283 rejects 1
## 284 reluctant 1
## 285 resign 1
## 286 riot 1
## 287 sag 1
## 288 scandal 1
## 289 scare 1
## 290 screaming 1
## 291 sentence 1
## 292 sickening 1
## 293 sin 1
## 294 sinister 1
## 295 slam 1
## 296 slave 1
## 297 stint 1
## 298 supremacy 1
## 299 surgery 1
## 300 sweat 1
## 301 terrorist 1
## 302 theft 1
## 303 threat 1
## 304 threatening 1
## 305 troll 1
## 306 turmoil 1
## 307 uncertain 1
## 308 unemployed 1
## 309 uphill 1
## 310 vanished 1
## 311 venom 1
## 312 vigilant 1
## 313 violent 1
## 314 worrying 1
## 315 wrath 1
## 316 xenophobia 1
Next, let’s use sentiments from bing to get an idea of how sentiment changed between 2020, 2021, and 2022 in COVID tweets. We’ll graph them on a 3-facted histogram.
GB_covid_sentiment <- tidy_tweetsGB %>%
inner_join(get_sentiments('bing')) %>%
count(year, index=id %/% 80, sentiment) %>%
pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
mutate(sentiment=positive-negative)
ggplot(GB_covid_sentiment, aes(sentiment, fill=year)) +
geom_histogram(show.legend=FALSE) +
facet_wrap(~year, ncol=2, scales='free_x')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s pull the sentiment scores from these graphs for each year:
GB_covid_sentiment_means <- aggregate(x=GB_covid_sentiment$sentiment, by=list(GB_covid_sentiment$year), FUN=mean)
GB_covid_sentiment_means
## Group.1 x
## 1 y2020 -0.5423032
## 2 y2021 -0.3636364
## 3 y2022 -0.4652062
Now we’ll do all that we did with our GB data above with our US and CA data
US_covid_sentiment <- tidy_tweetsUS %>%
inner_join(get_sentiments('bing')) %>%
count(year, index=id %/% 80, sentiment) %>%
pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
mutate(sentiment=positive-negative)
ggplot(US_covid_sentiment, aes(sentiment, fill=year)) +
geom_histogram(show.legend=FALSE) +
facet_wrap(~year, ncol=2, scales='free_x')
US_covid_sentiment_means <- aggregate(x=US_covid_sentiment$sentiment, by=list(US_covid_sentiment$year), FUN=mean)
US_covid_sentiment_means
## Group.1 x
## 1 y2020 -0.6611008
## 2 y2021 -0.6183871
## 3 y2022 -0.5228113
CA_covid_sentiment <- tidy_tweetsCA %>%
inner_join(get_sentiments('bing')) %>%
count(year, index=id %/% 80, sentiment) %>%
pivot_wider(names_from=sentiment, values_from=n, values_fill=0) %>%
mutate(sentiment=positive-negative)
ggplot(CA_covid_sentiment, aes(sentiment, fill=year)) +
geom_histogram(show.legend=FALSE) +
facet_wrap(~year, ncol=2, scales='free_x')
CA_covid_sentiment_means <- aggregate(x=CA_covid_sentiment$sentiment, by=list(CA_covid_sentiment$year), FUN=mean)
CA_covid_sentiment_means
## Group.1 x
## 1 y2020 -0.6140260
## 2 y2021 -0.6931007
## 3 y2022 -0.3765244
Using bing, we can analyze individual postive/negative attributions to check for major outliers. For example, trump is listed as a positive word in bing, but we know that it likely instead refers to a person rather than any sentiment particularly. Furthermore, positive is being treated as a positive word, when it could pertain to a positive covid test - a decidedly negative subject. A custom library for filtering can remove those skewing words.
bing_word_counts_US <- tidy_tweetsUS %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort=TRUE) %>%
ungroup()
head(bing_word_counts_US)
## word sentiment n
## 1 positive positive 613
## 2 trump positive 468
## 3 virus negative 425
## 4 death negative 306
## 5 died negative 297
## 6 safe positive 241
Here’s a graphical representation of our data:
bing_word_counts_US %>%
group_by(sentiment) %>%
slice_max(n, n=10) %>%
ungroup() %>%
mutate(word=reorder(word, n)) %>%
ggplot(aes(n, word, fill=sentiment)) +
geom_col(show.legend=FALSE) +
facet_wrap(~sentiment, scales='free_y') +
labs(x='Contribution to Sentiment', y=NULL)
Let’s now make a list of custom stop words to filter our positive sentiment data:
custom_stopwords <- bind_rows(tibble(word=c('trump','positive'),
lexicon=c('custom')),
stop_words)
custom_stopwords
## # A tibble: 1,151 × 2
## word lexicon
## <chr> <chr>
## 1 trump custom
## 2 positive custom
## 3 a SMART
## 4 a's SMART
## 5 able SMART
## 6 about SMART
## 7 above SMART
## 8 according SMART
## 9 accordingly SMART
## 10 across SMART
## # … with 1,141 more rows
Let’s next look into Word Clouds. First, we’ll graph a simple word cloud with larger words denoting more common words. Then, we’ll do the same but instead map sentiment to a color gradient.
library(wordcloud)
tidy_tweetsUS %>%
anti_join(custom_stopwords) %>%
count(word) %>%
with(wordcloud(word, n, max.words=75))
tidy_tweetsUS %>%
anti_join(custom_stopwords) %>%
inner_join(get_sentiments('bing')) %>%
count(word, sentiment, sort=TRUE) %>%
acast(word~sentiment, value.var='n', fill=0) %>%
comparison.cloud(colors=c('forestgreen','magenta','cyan'), max.words=70)
And now, let’s get counts of how many negative words appear compared to the number of total words, and our ratio of negative words. This can better illustrate the differences in negativity between years, since they have different word counts.
bingnegative <- get_sentiments('bing') %>%
filter(sentiment=='negative')
wordcounts <- tidy_tweetsCA %>%
group_by(year) %>%
summarize(words=n())
tidy_tweetsCA %>%
semi_join(bingnegative) %>%
group_by(year) %>%
summarize(negativewords=n()) %>%
left_join(wordcounts, by='year') %>%
mutate(ratio=negativewords/words) %>%
slice_max(ratio, n=3) %>%
ungroup()
## # A tibble: 3 × 4
## year negativewords words ratio
## <chr> <int> <int> <dbl>
## 1 y2020 2711 48348 0.0561
## 2 y2021 1715 31882 0.0538
## 3 y2022 753 17302 0.0435