This tutorial series will introduce you to computational tools, tricks, and tips that you will find useful in conducting quantitative social science research. I develop the tutorial for COMM621 taught at UMass-Amherst. If you are an instructor or a student of quantitative research methods, feel free to adapt it for your class.

If you want to learn R for mining, analyzing, and visualizing social media data, here is [another set of tutorials] I’ve developed (https://curiositybits.shinyapps.io/R_social_data_analytics/#section-data-frames).

#Samples and sampling distribution

If you recall from reading Charles Wheelan’s Naked Statistics, the magic of inferential statistics lies in the central limit theorem (The Lebron James of statistics as Wheelan would call it). Based on the theorem, a larger and properly drawn sample will approximate the population data we want to make inference about.

Below we demonstrate the central limit theorem using part of the open data provided by LA Police Department. The dataset contains all arrests made by LAPD in 2016. For simplicity, I included only columns that are relevant to the demo. Read here for a description of variables.

library(DT)
library(readr)
library(ggplot2)
library(pastecs)
library(sampling)
library(gmodels)
library(data.table)

arrests <- read.csv("https://curiositybits.cc/files/la_arrests_2016.csv")

#show the first 500 arrests
datatable(arrests[1:500,], options = list(pageLength = 5))

Before drawing samples, take a look at the distribution of age (the age of those arrested) in the population data. We run the following code to get a set of descriptive stats, including mean, median, standard deviations, etc. Because the dataset includes all of the arrests in 2016, we can treat it as population data. The population mean, as noted in the output below, is 35.55.

#the age info is stored in the AGE column
age <- arrests$AGE 
stat.desc(age, basic = F) #show the descriptive stats of age

##       median         mean      SE.mean CI.mean.0.95          var 
##   32.0000000   34.9274404    0.0894575    0.1753430  177.8987791 
##      std.dev     coef.var 
##   13.3378701    0.3818737

qplot(age, geom="histogram") #make a histogram

We will construct a sample of 100 from age to see how the sample mean approximates the population mean.

sample1 <- sample(age, 100)
summary(sample1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   24.75   31.00   35.24   46.00   80.00

qplot(sample1, geom="histogram")

Based on the histogram of sample1, the age values are quite dispersed, with notable outliers. Next, we construct a sample of 500 to see if the new sample mean moves closer to the population mean.

sample2 <- sample(age, 500)
summary(sample2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   25.00   32.00   35.47   46.00   85.00

qplot(sample2, geom="histogram")

Try a bigger sample. This time we draw 5000 cases.

sample3 <- sample(age, 5000)
summary(sample3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   32.00   34.95   45.00   85.00

qplot(sample3, geom="histogram")

The histogram of sample3 appears more similar to how the age values are distributed in the population data.

Based on the central limit theorem, if we draw large enough number of samples, the distribution of sample means will approximate the normal distribution (the bell curve), and the larger the sample size, the tighter that distribution will be.

Let’s demonstrate how that plays out in our data. Below we draw 5000 samples with each sample containing 1000 cases. We plot the sample means in the histogram shown below.

sample_means <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(age, 1000)
  sample_means[i] <- mean(samp)
}
qplot(sample_means, geom="histogram")

summary(sample_means)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.55   34.65   34.93   34.93   35.20   36.46

#Sampling designs

To be able to arrive at generalizable conclusions about the population, we must use properly constructed samples. Below introduces several commonly used sampling designs.

In our case, we have access to the population data (all arrests made in 2016 in LA), so it is fairly easy to produce a random sample. In the code below, we create a sample of 500 cases, without replacement.

randomsample <- arrests[sample(1:nrow(arrests), 500, replace=FALSE),] 
knitr::kable(head(randomsample, 5)) #show the first 5 cases

	AREA_DESC	AGE	SEX_CD	DESCENT_CD	GRP_DESC	ARST_TYP_CD	CHARGE	CHRG_DESC	LOCATION
18806	Topanga	30	F	H	NA	M	490.2PC	PETTY THEFT	19700 VANOWEN ST
7367	Hollywood	54	F	W	Disorderly Conduct	M	41.18DLAMC	SIT/LIE/SLEEP SIDEWALK OR STREET	1100 N WESTERN AV
21910	Wilshire	43	M	B	Larceny	M	484(A)PC	GRAND THEFT (OVER $400)	7200 MELROSE AV
13252	Olympic	73	M	B	Miscellaneous Other Violations	M	853.7PC	FTA AFTER WRITTEN PROMISE	VERMONT
17997	Southwest	34	M	B	Drunkeness	M	41.27(C)LAM	DRINKING IN PUBLIC***	NORMANDIE

Take a look at the descriptive stats of the random sample and see how cases are distributed by sex and geographic areas.

mean(randomsample$AGE) #show mean

## [1] 34.576

sd(randomsample$AGE) #show standard deviation

## [1] 13.71702

table(randomsample$SEX_CD) #show frequency count by sex

## 
##   F   M 
## 103 397

table(randomsample$AREA_DESC) #show frequency count by geographic areas

## 
## 77th Street     Central  Devonshire    Foothill      Harbor  Hollenbeck 
##          27          70          15          14           8          13 
##   Hollywood     Mission N Hollywood      Newton   Northeast     Olympic 
##          48          25          24          20          12          14 
##     Pacific     Rampart   Southeast   Southwest     Topanga    Van Nuys 
##          40          14          24          37          19          35 
##     West LA West Valley    Wilshire 
##          13          14          14

As you learn from the above sample and the population, the majority of the arrested are male. Also, it is not far-fetched to say that crimes are probably concentrated in certain areas. But, what if we disregard the role of gender and locality and want to generate a sample with equal representation from both sexes and all LA areas?

Below we create a stratified sample based on two separate strata (sex and area). The sex vairable is called SEX_CD and geographic area info is stored in AREA_DESC.

arrests <- data.table(arrests)
setkey(arrests, AREA_DESC, SEX_CD)

#display # of cases in each strata
knitr::kable(arrests[, .N, keyby = list(AREA_DESC, SEX_CD)])

AREA_DESC	SEX_CD	N
77th Street	F	321
77th Street	M	949
Central	F	716
Central	M	2308
Devonshire	F	170
Devonshire	M	446
Foothill	F	129
Foothill	M	569
Harbor	F	140
Harbor	M	542
Hollenbeck	F	103
Hollenbeck	M	585
Hollywood	F	397
Hollywood	M	1718
Mission	F	263
Mission	M	848
N Hollywood	F	254
N Hollywood	M	801
Newton	F	166
Newton	M	888
Northeast	F	123
Northeast	M	536
Olympic	F	130
Olympic	M	699
Pacific	F	346
Pacific	M	1622
Rampart	F	160
Rampart	M	762
Southeast	F	203
Southeast	M	582
Southwest	F	236
Southwest	M	1033
Topanga	F	178
Topanga	M	482
Van Nuys	F	315
Van Nuys	M	861
West LA	F	96
West LA	M	370
West Valley	F	156
West Valley	M	471
Wilshire	F	105
Wilshire	M	451

You will note that there are 42 rows in the above output, meaning there are 42 strata. Run the code below to randomly select 15 cases from each strata. The parameter “srswr” means “simple random sampling with replacement,” which is the method used to select units. Run ?strata() for the help document.

set.seed(2)
ss <- data.table(strata(arrests, c("AREA_DESC", "SEX_CD"), rep(15, 42), "srswr"))
datatable(ss, options = list(pageLength = 5))

Cluster sampling is another commonly used design. The code below demonstrates a simple case of cluster sampling: we first sample 5 areas and then randomly select cases from the 5 areas.

cs <- cluster(arrests, clustername=c("AREA_DESC"),size=5,method="srswor")

#see which areas are selected in the sample and the number of cases in each area
cs <- data.table(cs)
knitr::kable(cs[, .N, keyby = list(AREA_DESC)])

AREA_DESC	N
Hollenbeck	688
N Hollywood	1055
Newton	1054
Pacific	1968
Rampart	922

#Is your statistical significance substantive?

A great myth in social science is p-value. It is not an intuitive concept to wrap your head around. Even many scientists struggle to explain it in laymen’s terms, as shown in the video below.

What is a p-value?

We use p-value to indicate statistical significance. The magic number here is .05: a p-value below .05 indicates a statistically significant relationship between variables. It indicates there is a 95% chance that the relationship found in the study sample is NOT due to random chances, and thus the finding from the sample accurately reflects the patterns in the population.

But, even that is an over-extrapolation. Read this article by FiveThirtyEight.

This is how the article explains p-value:

So what information can you glean from a p-value? The most straightforward explanation I found came from Stuart Buck, vice president of research integrity at the Laura and John Arnold Foundation. Imagine, he said, that you have a coin that you suspect is weighted toward heads. (Your null hypothesis is then that the coin is fair.) You flip it 100 times and get more heads than tails. The p-value won’t tell you whether the coin is fair, but it will tell you the probability that you’d get at least as many heads as you did if the coin was fair. That’s it — nothing more.

Putting p-value to test

We will use a dataset containing the math scores of all students in the New York area. We download the data and name the data frame math. The scores are in the Mean.Scale.Score column and students’ gender is stored in Category.

library(readr)
library(ggplot2)
library(pastecs)
library(sampling)
library(gmodels)
library(dplyr)
library(data.table)

math <- read.csv("https://curiositybits.cc/files/math.csv")

#add a column for row id. We will need the row ids for merging datasets in later steps
math <- mutate(math, ID_unit = rownames(math)) 
math$ID_unit <- as.integer(math$ID_unit)

# Convert the Category column into a categorical variable
math$Category <- as.factor(math$Category)

#show the first 50 observations
datatable(math[1:50,], options = list(pageLength = 5))

Since this dataset can be treated as the population data, we know exactly how female and male students differ in math scores. Run the code below to see the mean difference. Female student do slightly better than male student. But the difference isn’t that big (<2).

aggregate(math$Mean.Scale.Score,by=list(Gender=math$Category), mean)

##   Gender        x
## 1 Female 300.2872
## 2   Male 298.7334

In most research scenarios, we don’t have access to the population data and thus have to resort to sampling. Next, we draw a random sample of 100, compare means across groups, and make inference about the population using this sample. Because this is a by group comparison, we will use a simple t-test.

sample1 <- sample_n(math, 100)
t.test(Mean.Scale.Score ~ Category,data=sample1)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = -0.57588, df = 97.961, p-value = 0.566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.751783   5.915172
## sample estimates:
## mean in group Female   mean in group Male 
##             296.6383             299.0566

What is the p-value of the above t-test? Is it statistically significant?

If see a p value lower than .05, it means that the test is NOT significant, thus the difference could be due to random change and unlikely reflect the true difference in the population. The 95 percent confidence interval gives you an estimate of a range of possible mean difference. Notice that 0 is inside of the range.

Let run a test using a bigger sample (n=1000). Is the test statistically significant this time?

sample2 <- sample_n(math, 1000)
t.test(Mean.Scale.Score ~ Category,data=sample2)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 0.44966, df = 995.83, p-value = 0.6531
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.061520  3.287114
## sample estimates:
## mean in group Female   mean in group Male 
##             300.0813             299.4685

Try a even bigger sample (n=5000)

sample3 <- sample_n(math, 5000)
t.test(Mean.Scale.Score ~ Category,data=sample3)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 3.4766, df = 4997.4, p-value = 0.0005122
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9315152 3.3404871
## sample estimates:
## mean in group Female   mean in group Male 
##             300.2922             298.1562

Very likely, you will see a significant difference in a large sample. A rule of thumb is that all else constant, you will more likely find significant findings when using larger samples.

In our case, with a sample of 5000, even very tiny differences in math scores become detectable. But is the difference meaningful? In other words, does statistical significance mean substantive significance? It does not.

A p-value says nothing about the “strength of association,” or the meaningfulness of difference. Yes, females have better math scores than males, but the difference in the sample and the population is negligible–it will be absurd to talk about a substantive gap in mathematical skills when male and female students are just a few points apart.

The takeaway? Statistical significance ≠ substantive significance.

Some social scientists are accused of engaging in p-hacking to inflate their scholarly contribution. Sometimes, when they have trouble finding significant results, they are tempted to try a bigger sample without thinking about whether a significant, but tiny difference between groups is truly meaningful. There is an entire episode on Last Week Tonight with John Oliver dedicated to uncovering p-hacking.

#Clean survey data

Survey is a widely used social science method for studying public opinion. There are many open datasets from public institutions such as (Pew Research Center)[http://www.pewresearch.org/]. In this tutorial, we will work with a 2016 dataset about how Americans approach facts and information as well as technology use.

You might be surprised to know that social scientists generally spend more time cleaning data than actually running statistical analyses. This is to say that data cleaning is a critical step. Below I will show how to clean survey data from Pew.

The Pew dataset will look like this (see below), with each column representing an unique survey item and rows matching to survey respondents.

a screen capture of the Pew dataset

Responses are already converted into numerical rating scales (e.g., Somewhat interested = 2). You may find other public datasets (such as GSS General Social Survey) that retain answer choices. In such a case, you would need to use the string replace function in R to transform answer choices into numbers. See the example below.


# replace "somewhat interested" with 2 in q1 column of df.
df$q1 <- gsub("Somewhat interested", 2, df$q1)

Back to the Pew dataset, we run the code below to download the data.

library(ggplot2)
library(psych)
library(dplyr)
library(car)
library(ggplot2)

data <- read.csv("https://curiositybits.cc/files/pew_info_engagement.csv")

Suppose we are interested in how race, sex, and political orientation predict trust in information. The outcome variable trust can be measured by the item q6a through q6h. Sex is measured by sex, political orientation by ideo, and race by race3m1.

Step 1: Remove NAs

What we do here is called casewise deletion, that is, excluding all cases that have missing data in at least one of the selected variables. It is a relatively rigorous way of dealing with missing values. But when faced with randomly distributed NAs across variables, researchers can consider using pairwise deletion. The latter is not ideal but acceptable. Read more here.

#clean_data contains only columns of interest
clean_data <- data[,c("sex","ideo", "race3m1", "q5", "q6a", "q6b", "q6c","q6d", "q6e", "q6f", "q6g", "q6h","q3a","q3b","q3c","q3d","q3e","q3f")] 

clean_data <- clean_data %>% na.omit() #remove NAs

Observe how by applying the code above, the valid number of cases drops from 3,015 to 2,051.

Step 2: Exclude inapplicable cases

We also should remove the refused and don’t know answers (Don’t know = 8; Refused=9). We put the data after cleaning into a new data frame (called clean_data). Note that R is a very flexible programming language and there are many ways to do the same trick. You might find a more simple way to remove inapplicable cases.

clean_data <- clean_data[clean_data$sex<=5 & 
                           clean_data$race3m1 <=7 &
                           clean_data$ideo <=5 &
                           clean_data$q5 <=3 &
                           clean_data$q6a <=4 &
                           clean_data$q6b <=4 &
                           clean_data$q6c <=4 &
                           clean_data$q6d <=4 &
                           clean_data$q6e <=4 &
                           clean_data$q6f <=4 &
                           clean_data$q6g <=4 &
                           clean_data$q6h <=4 ,]

Step 3: Convert nominal/categorical variables into factors

R treats nominal/categorical variables as factors. Thus, we use as.factor() to convert variables and plot their distribution.

clean_data$sex <- as.factor(clean_data$sex)
clean_data$race <- as.factor(clean_data$race3m1)
clean_data$ideo <- as.factor(clean_data$ideo)

Step 4: Create a composite index for trust

We can just take the average of the responses to q6a through q6h. We then plot the distribution of the dependent variable by sex.

clean_data$trust <- (clean_data$q6a + clean_data$q6b + clean_data$q6c + clean_data$q6d + clean_data$q6e + clean_data$q6f + clean_data$q6g + clean_data$q6h)/8

#density plot for DV by sex
ggplot(clean_data, aes(clean_data$trust)) + 
  geom_density(aes(data = clean_data$trust, fill = clean_data$sex), position = 'identity', alpha = 0.5) +
  labs(x = 'Trust', y = 'Density') + scale_fill_discrete(name = 'Sex 1=M, 2=F') + scale_x_continuous(limits = c(1, 4))

Now you have a composite variable called trust. But if you look at how answer choices on q6a through q6h are scaled in the original Pew questionaire, you would think that this item should be really called distrust. This is because a higher number on trust indicates a higher degree of distrust. To avoid confusion, it is better to reverse code this variable by doing the following.

clean_data$trust <- reverse.code(-1, clean_data$trust)

Because trust is a multi-item measure, we should test its reliability by using the alpha() function. A reliable measure should achieve a reliability score of .75 or above.

#test the reliability of the DV
trust_items <- select(clean_data, c("q6a","q6b","q6c","q6d","q6e","q6f","q6g","q6h"))
alpha(trust_items)

## 
## Reliability analysis   
## Call: alpha(x = trust_items)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean   sd median_r
##       0.65      0.65    0.65      0.19 1.9 0.012  2.2 0.45     0.17
## 
##  lower alpha upper     95% confidence boundaries
## 0.63 0.65 0.68 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## q6a      0.58      0.58    0.56      0.16 1.4    0.015 0.0070  0.16
## q6b      0.65      0.65    0.64      0.21 1.8    0.012 0.0158  0.20
## q6c      0.66      0.66    0.65      0.22 2.0    0.012 0.0133  0.20
## q6d      0.65      0.64    0.63      0.20 1.8    0.012 0.0162  0.18
## q6e      0.59      0.58    0.57      0.17 1.4    0.014 0.0096  0.15
## q6f      0.58      0.58    0.57      0.17 1.4    0.015 0.0087  0.16
## q6g      0.62      0.62    0.61      0.19 1.6    0.013 0.0167  0.16
## q6h      0.63      0.63    0.62      0.19 1.7    0.013 0.0167  0.18
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean   sd
## q6a 1872  0.66  0.65  0.63   0.49  2.2 0.88
## q6b 1872  0.44  0.45  0.29   0.23  2.9 0.83
## q6c 1872  0.35  0.39  0.21   0.16  2.0 0.70
## q6d 1872  0.50  0.48  0.33   0.27  1.9 0.95
## q6e 1872  0.64  0.64  0.60   0.48  2.1 0.81
## q6f 1872  0.67  0.64  0.60   0.48  2.2 0.93
## q6g 1872  0.53  0.54  0.42   0.35  1.7 0.77
## q6h 1872  0.51  0.52  0.39   0.31  2.2 0.81
## 
## Non missing response frequency for each item
##        1    2    3    4 miss
## q6a 0.18 0.53 0.17 0.12    0
## q6b 0.03 0.31 0.39 0.27    0
## q6c 0.23 0.62 0.12 0.04    0
## q6d 0.43 0.39 0.07 0.11    0
## q6e 0.18 0.59 0.15 0.09    0
## q6f 0.21 0.47 0.18 0.13    0
## q6g 0.43 0.46 0.07 0.04    0
## q6h 0.15 0.58 0.17 0.10    0

The measure of trust has a pretty low reliability score. But that is totally expected because while q6a through q6h are all about trust, they actually address different and sometimes conflicting aspects of trust (e.g., trust in social media vs. trust in financial institution). So, do you still think it is a good idea to lump together eight items to create the trust index?

#T-test

When to use t-test?

T-test is a simple inferential statistical test used for comparing means across groups. It can reveal whether mean differences across different groups are statistically significant. You typically use t-test when you only have one categorical independent variable (e.g., sex) and one dependent variable.

Below we compare whether there is significant gender difference on view of Republican party.

Let’s inspect the data first. The sex variable indicates whether a user is a male or female (1=male, 2=female). The column qa15a points to a survey question Would you say your overall opinion of the Republican Party? (1=is very favorable, 4=very unfavorable).

poli_view <- read.csv("poli_view.csv")

poli_view$sex <- as.factor(poli_view$sex) #treat the column sex as a factor 

#show means across group 
aggregate(poli_view$qa15a,by=list(Sex=poli_view$sex),mean)

##   Sex        x
## 1   1 2.977201
## 2   2 3.083172

Run the t-test.

t.test(qa15a ~ sex,data=poli_view) #run the t-test

## 
##  Welch Two Sample t-test
## 
## data:  qa15a by sex
## t = -1.7917, df = 2219.9, p-value = 0.07331
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.22195461  0.01001283
## sample estimates:
## mean in group 1 mean in group 2 
##        2.977201        3.083172

Is there a gender difference then?

Now, let’s apply t-test to another scenario. Let’s test if there is a significant difference in math score between boys and girls. The data are collected from public schools in New York City between 2013-2017.

math <- read.csv("math.csv")

#treat sex as a categorical variable
math$Category <- as.factor(math$Category)

#show means across group 
aggregate(math$Mean.Scale.Score,by=list(Gender=math$Category),mean)

##   Gender        x
## 1 Female 300.2872
## 2   Male 298.7334

In fact, the data frame math contains data about all pupils in NY. To illustrate how sampling works and how findings may change from sample to sample, let’s draw three random samples.

#sample 100 cases from the original dataset 
library(dplyr)
sample1 <- sample_n(math, 100)
aggregate(sample1$Mean.Scale.Score,by=list(Gender=sample1$Category),mean)

##   Gender        x
## 1 Female 299.0417
## 2   Male 295.8269

Now, draw another sample of 100.

sample2 <- sample_n(math, 100)
aggregate(sample2$Mean.Scale.Score,by=list(Gender=sample2$Category),mean)

##   Gender        x
## 1 Female 299.3036
## 2   Male 297.2727

This time, draw a sample of 2000.

sample3 <- sample_n(math, 2000)
aggregate(sample3$Mean.Scale.Score,by=list(Gender=sample3$Category),mean)

##   Gender        x
## 1 Female 300.2493
## 2   Male 299.0785

Let’s run three separate t-tests on the three samples and compare results.

t.test(Mean.Scale.Score ~ Category,data=sample1)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 0.72498, df = 97.021, p-value = 0.4702
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.585969 12.015456
## sample estimates:
## mean in group Female   mean in group Male 
##             299.0417             295.8269

t.test(Mean.Scale.Score ~ Category,data=sample2)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 0.43128, df = 84.158, p-value = 0.6674
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.333009 11.394698
## sample estimates:
## mean in group Female   mean in group Male 
##             299.3036             297.2727

t.test(Mean.Scale.Score ~ Category,data=sample3)

## 
##  Welch Two Sample t-test
## 
## data:  Mean.Scale.Score by Category
## t = 1.1866, df = 1982.3, p-value = 0.2355
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7642336  3.1057789
## sample estimates:
## mean in group Female   mean in group Male 
##             300.2493             299.0785

#Correlation

When to use correlation?

You can use correlation to test the strength of the relationship between two continuous variables. Note that the two variables must be continuous.

## The Script for Correlation
wage_self_esteem <- read.csv("wage_self_esteem.csv")

#test the correlation between self-esteem and wage reported in 1978
cor.test(wage_self_esteem$SELF.ESTEEM, wage_self_esteem$INCOME, method=c("pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  wage_self_esteem$SELF.ESTEEM and wage_self_esteem$INCOME
## t = 2.6257, df = 6456, p-value = 0.008667
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.008277817 0.057006039
## sample estimates:
##        cor 
## 0.03266134

#ANOVA

How do sex, race, and ideologies predict trust in social media news?

Now, return to the dataset we previously worked on in the Clean survey data section. Recall that the data frame named clean_data contains four variables of interest: sex (a categorical variable), race (a categorical variable), ideo (a categorical variable), and the dependent variable trust (which is rated on a continuous scale).

If we have only one categorical independent variable, a test-test will suffice. However, in the current case, we have more than one categorical variable, then we would need to use ANOVA. ANOVA is a statistical test used to predict a continuous variable from more than one independent categorical variables.

Before running a ANOVA model, it will be wise to explore key variables to check their distribution.

Explore the statistical distribution of the dependent variable. Is it normally distributed?

#explore the distribution of the DV
hist(clean_data$trust)

#density plot for DV by sex
ggplot(clean_data, aes(clean_data$trust)) + 
  geom_density(aes(data = clean_data$trust, fill = clean_data$sex), position = 'identity', alpha = 0.5) +
  labs(x = 'Trust', y = 'Density') + scale_fill_discrete(name = 'Sex 1=M, 2=F') + scale_x_continuous(limits = c(1, 4))

## Warning: Ignoring unknown aesthetics: data

#density plot for DV by race
ggplot(clean_data, aes(clean_data$trust)) + 
  geom_density(aes(data = clean_data$trust, fill = clean_data$race), position = 'identity', alpha = 0.5) +
  labs(x = 'Trust', y = 'Density') + scale_fill_discrete(name = 'Race') + scale_x_continuous(limits = c(1, 4))

## Warning: Ignoring unknown aesthetics: data

#density plot for DV by ideology
ggplot(clean_data, aes(clean_data$trust)) + 
  geom_density(aes(data = clean_data$trust, fill = clean_data$ideo), position = 'identity', alpha = 0.5) +
  labs(x = 'Trust', y = 'Density') + scale_fill_discrete(name = 'Ideologies') + scale_x_continuous(limits = c(1, 4))

## Warning: Ignoring unknown aesthetics: data

Run a ANOVA model.

#Perform ANOVA
aov.1 <- aov(trust ~ sex+race+ideo, data = clean_data)
summary(aov.1)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## sex            1    4.7   4.746  23.900 1.10e-06 ***
## race           5    4.4   0.875   4.407 0.000541 ***
## ideo           4    6.5   1.617   8.143 1.66e-06 ***
## Residuals   1861  369.6   0.199                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Can you identify significant variables from the result table above? What does a significant relationship mean in this case?

Now, perform a post-hoc analysis.

TukeyHSD(aov.1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = trust ~ sex + race + ideo, data = clean_data)
## 
## $sex
##          diff        lwr     upr   p adj
## 2-1 0.1007296 0.06031921 0.14114 1.1e-06
## 
## $race
##            diff         lwr          upr     p adj
## 2-1  0.01952864 -0.07828965  0.117346920 0.9929688
## 3-1  0.20463234  0.05598582  0.353278864 0.0012501
## 5-1 -0.08644492 -0.33797267  0.165082837 0.9242452
## 6-1  0.25060610 -0.48411873  0.985330938 0.9265086
## 7-1  0.10032190 -0.02280546  0.223449249 0.1847181
## 3-2  0.18510370  0.01349277  0.356714642 0.0258269
## 5-2 -0.10597355 -0.37171918  0.159772069 0.8656888
## 6-2  0.23107747 -0.50863539  0.970790322 0.9487084
## 7-2  0.08079326 -0.06925630  0.230842818 0.6409234
## 5-3 -0.29107726 -0.57943140 -0.002723121 0.0463352
## 6-3  0.04597376 -0.70215890  0.794106425 0.9999772
## 7-3 -0.10431044 -0.29150618  0.082885288 0.6055661
## 6-5  0.33705102 -0.43810890  1.112210943 0.8169024
## 7-5  0.18676681 -0.08929960  0.462833229 0.3840410
## 7-6 -0.15028421 -0.89376725  0.593198839 0.9925517
## 
## $ideo
##           diff          lwr       upr     p adj
## 2-1 0.05509343 -0.055382957 0.1655698 0.6523882
## 3-1 0.12342123  0.016519362 0.2303231 0.0142083
## 4-1 0.17257120  0.058076429 0.2870660 0.0003880
## 5-1 0.19195861  0.064317320 0.3195999 0.0004033
## 3-2 0.06832780 -0.004842392 0.1414980 0.0804148
## 4-2 0.11747777  0.033603339 0.2013522 0.0012761
## 5-2 0.13686518  0.035780232 0.2379501 0.0020913
## 4-3 0.04914997 -0.029956903 0.1282568 0.4363380
## 5-3 0.06853738 -0.028628169 0.1657029 0.3037834
## 5-4 0.01938741 -0.086074368 0.1248492 0.9871887

#summary stats by groups
tapply(clean_data$trust, clean_data$sex, summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.500   2.875   2.789   3.125   4.000 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.000   2.890   3.125   4.000

tapply(clean_data$trust, clean_data$race, summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.500   2.875   2.826   3.125   4.000 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   2.875   2.848   3.125   3.750 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.625   2.750   3.000   3.016   3.250   3.750 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.875   2.281   2.688   2.726   3.125   3.625 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.062   3.125   3.125   3.188   3.250 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.625   3.000   2.924   3.250   3.875

tapply(clean_data$trust, clean_data$ideo, summary)

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.375   2.750   2.708   3.000   3.875 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.500   2.750   2.780   3.125   4.000 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.625   2.875   2.850   3.125   3.875 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.000   2.911   3.250   3.875 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.625   3.000   2.931   3.250   4.000

Regression

How do skill and openness to cross-cutting exposure predict trust in social media source, after controling for sex, race and ideologies?

Answering the question above requires using more than one independent variable and some of the independent variables are categorical and others are continuous.This is where regression comes into play. You use regression when your dependent variable is continuous and your independent variables are continuous or when your independent variables are mixed (i.e., some are continuous and others are categorical).

In this example, we use the same dataset, but will create two new continuous variables from the raw data. The two new variables are skill and openness to cross-cutting exposure.

#create variables

#remove missing values and inapplicable cases
clean_data <- clean_data[!is.na(clean_data$q5)&
                        !is.na(clean_data$q3a) &
                        !is.na(clean_data$q3b) &
                        !is.na(clean_data$q3c) &
                        !is.na(clean_data$q3d) &
                        !is.na(clean_data$q3e) &
                        !is.na(clean_data$q3f),]

clean_data<- clean_data[clean_data$q5<=3 &
                                clean_data$q3a <= 2 &
                                clean_data$q3b <= 2 &
                                clean_data$q3c <= 2 &
                                clean_data$q3d <= 2 &
                                clean_data$q3e <= 2 &
                                clean_data$q3f <= 2,]

library(car)
clean_data$q3b_reversed <- recode(clean_data$q3b, "1=2; 2=1")
clean_data$q3c_reversed <- recode(clean_data$q3c, "1=2; 2=1")
clean_data$q3d_reversed <- recode(clean_data$q3d, "1=2; 2=1")
clean_data$q3f_reversed <- recode(clean_data$q3f, "1=2; 2=1")

clean_data$skill <- clean_data$q5
clean_data$skill <- recode(clean_data$skill, "1=3; 2=2;3=1")

clean_data$openness <- clean_data$q3a + clean_data$q3b_reversed + clean_data$q3c_reversed + clean_data$q3d_reversed + clean_data$q3e + clean_data$q3f_reversed

Plot the key variables.

hist(clean_data$skill)

hist(clean_data$openness)

Perform regression

lm.1 <- lm(skill~ openness+ideo+sex+race, data = clean_data)
summary(lm.1)

## 
## Call:
## lm(formula = skill ~ openness + ideo + sex + race, data = clean_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8043 -0.5374  0.2979  0.4082  0.9228 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.862520   0.162010  11.496  < 2e-16 ***
## openness     0.077866   0.014905   5.224 1.95e-07 ***
## ideo2       -0.060737   0.053647  -1.132 0.257719    
## ideo3        0.007417   0.052175   0.142 0.886966    
## ideo4       -0.016933   0.055969  -0.303 0.762271    
## ideo5        0.050796   0.062358   0.815 0.415412    
## sex2        -0.066499   0.027570  -2.412 0.015965 *  
## race2       -0.165911   0.045488  -3.647 0.000272 ***
## race3       -0.063471   0.069635  -0.911 0.362169    
## race5        0.035560   0.121053   0.294 0.768976    
## race6        0.396826   0.339322   1.169 0.242369    
## race7       -0.358886   0.057703  -6.220 6.17e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5865 on 1821 degrees of freedom
## Multiple R-squared:   0.05,  Adjusted R-squared:  0.04426 
## F-statistic: 8.713 on 11 and 1821 DF,  p-value: 3.219e-15

Take a closer look at the regression model

library(lm.beta)
lm.beta(lm.1)

## 
## Call:
## lm(formula = skill ~ openness + ideo + sex + race, data = clean_data)
## 
## Standardized Coefficients::
##  (Intercept)     openness        ideo2        ideo3        ideo4 
##  0.000000000  0.120411725 -0.044373592  0.005867208 -0.011298187 
##        ideo5         sex2        race2        race3        race5 
##  0.026579526 -0.055425981 -0.084126065 -0.021097938  0.006740099 
##        race6        race7 
##  0.026746125 -0.143328105

#Text cleaning

Why text cleaning?

Textual data are always messy. The data may contain words that, if taken out of context, would be meaningless. You may also encounter a group of different words which convey the same meaning. Or you might have to convert slangs and acronyms into standard English, or emojis into something computer can recognize. Only by cleaning the mess and the noise in the text will you be able to discern useful patterns and signals.

From corpus to DFM There is a lot of interest in quantifying and visualizing textual data. Texts reveal our thoughts, our personality, and the pulse of a society. We broadly refer to the quantification of text as text mining. Thanks to the developments in Natural Language Processing and Information retrieval, we now have a wide selection of easy-to-use R libraries for cleaning, transforming, quantifying, and visualizing text.

Which R library?

Throughout the tutorials on text mining, we will use the library quantaeda, which stands for Quantitative Analysis of Textual Data. You can visit the library website to view many examples. There is another decent text mining library called tidytext. Although we will not use it since there is a considerable overlap in functionality between the two libraries, I still highly recommend checking out tidytext and a free e-book written by the library author.

What data? In this tutorial, we will experiment with a dataset containing 4,995 tweets containing the #metoo hashtag. You can download the data here. Or you can run the code below to download the data into your RStudio. For faster loading, we ask R to show the first 50 tweets and four selected columns. Text analytics typically begins with text files (e.g., a data frame or a CSV file) containing character strings (the actual text) and the metadata of the text. In the example below, we are interested in analyzing the text content in the text column.

library(readr)
library(DT)

metoo_tweets <- read.csv("https://curiositybits.cc/files/metoo.csv")

#show the first 500 arrests
datatable(metoo_tweets[1:50,c("status_id","screen_name","text","retweet_count")], options = list(pageLength = 5))

Workflow

A typical workflow in text analytics is:

create a corpus from text files;
tokenization;
create a DFM (document-feature matrix);
conduct a variety of text analyses.

Corpus

Think of a corpus as a container of different documents. A corpus is the complete collection of text you want to analyze, Documents are individual units in that collection. For example, if you want to analyze someone’s tweets (the corpus), each unique tweet can be treated as a document. Another example is: if you want to compare tweets sent by GOP and the Democratic Party, you can treat all tweets sent by one party as a single document. This treatment will lead you to have two documents in the corpus. Below we will create a corpus of 4,995 documents (i.e., tweets).

library(quanteda)

metoo_tweets$text <- as.character(metoo_tweets$text) #optional step: convert the text column to characters

tweet_corpus <- corpus(metoo_tweets,docid_field = "status_id",text_field = "text")

tweet_corpus

## Corpus consisting of 4,995 documents and 86 docvars.

Alternatively, you can combine tweets sent by the same users and create a corpus in which a document is an user’s tweets.You will see that the corpus created from the code below contains 4,090 documents (i.e., unique users). In order to create the corpus, we first create a new data frame (called tweets_byusers) by using aggregate() to merge tweets from the same user. Then, in corpus(), we specify docid_field = “screen_name”. This tells R to recognize the city column as the document IDs.

library(dplyr)
tweets_byusers <- aggregate(text~screen_name, data = metoo_tweets, paste0, collapse=". ")

#create a corpus based on the new data frame
user_corpus <- corpus(tweets_byusers,docid_field = "screen_name",text_field = "text")

user_corpus

## Corpus consisting of 4,090 documents and 0 docvars.

Tokenization

Run the code below and see how text is tokenized.

This tokenize the second tweet based on words.

tokens(tweet_corpus[2])

## tokens from 1 document.
## x1103090889126264833 :
##  [1] "Jessi"      "Hempel's"   "The"        "Problem"    "with"      
##  [6] "#METOO"     "and"        "Viral"      "Outrage"    "represent" 
## [11] "the"        "key"        "elements"   "of"         "a"         
## [16] "major"      "issue"      "in"         "society"    ";"         
## [21] "sexual"     "harassment" "and"        "women"      "are"       
## [26] "finally"    "speaking"   "out"        "about"      "their"     
## [31] "past"       "tragedies"  "."          "#COM416"    "https"     
## [36] ":"          "/"          "/"          "t.co"       "/"         
## [41] "15Uv2nGGGo"

tokens(), by default, segments a corpus into tokens where each token represents a word. The code above tokenizes the second document (i.e., the 2nd tweet) in the corpus.

We commonly tokenize corpus based on words. The following code tokenizes the corpus based on sentences so that each unique sentence is treated as a token. We do that by adding what = "sentence to the tokens() function.

tokens(tweet_corpus[2], what = "sentence")

## tokens from 1 document.
## x1103090889126264833 :
## [1] "Jessi Hempel's The Problem with #METOO and Viral Outrage represent the key elements of a major issue in society; sexual harassment and women are finally speaking out about their past tragedies."
## [2] "#COM416 https://t.co/15Uv2nGGGo"

DFM

With a corpus, we can go ahead and create a DFM (document feature matrix). What is a DFM? The best way to understand it is perhaps by playing around one. Run the code below. It will construct a DFM based on the corpus of 4,995 documents.

DFM is a matrix. Numbers in DFM represent how many times a token (commonly referring to a word or a sentence) appears in a given document. In the output above. The word confrontation appears once in document x1103092683717906432 (this refers to the status_id of a tweet).

Please note that the dfm() function performs tokenization under the hood. In other words, by default, dfm() tokenizes your text first before constructing a DFM object.

tweets_dfm <- dfm(tweet_corpus)

tweets_dfm[1:5,1:5] #show the first 5 words in the first 5 documents

## Document-feature matrix of: 5 documents, 5 features (64.0% sparse).
## 5 x 5 sparse Matrix of class "dfm"
##                       features
## docs                   #metoo confrontation can be alarming
##   x1103092683717906432      1             1   1  1        1
##   x1103090889126264833      1             0   0  0        0
##   x1103092651140808705      1             0   0  0        0
##   x1103092268490280961      1             0   0  0        0
##   x1103092648888410112      1             0   0  0        0

tweets_dfm[1:5,1:5] gives you the frequency distribution of the first 5 words across the first 5 documents. The complete matrix (the object named reviews_dfm) is 25,974 by 4,995. It means that this dfm object has 25,974 features (i.e., words) across 4,995 documents (tweets).

After you have a DFM from your text data, you can do all sorts of things. For instance, run the code below will show you the most frequent terms in your corpus.

library(ggplot2)

tweets_dfm %>% 
  textstat_frequency(n = 25) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequency") +
  theme_minimal()

Filter unwanted words

But, wait a minute! The most frequent terms are: /, ., :? Shouldn’t we take those meaningless and noisy words and punctuation marks out of the analysis? You are right! What I have demonstrated above is a very messy and rudimentary step in text mining. Next we will cover a set of standard text cleaning and transformation procedures.

In the code below, a number of cleaning criteria are added. Observe changes in the output.

tokens(tweet_corpus[2],remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE)

## tokens from 1 document.
## x1103090889126264833 :
##  [1] "Jessi"      "Hempel's"   "The"        "Problem"    "with"      
##  [6] "METOO"      "and"        "Viral"      "Outrage"    "represent" 
## [11] "the"        "key"        "elements"   "of"         "a"         
## [16] "major"      "issue"      "in"         "society"    "sexual"    
## [21] "harassment" "and"        "women"      "are"        "finally"   
## [26] "speaking"   "out"        "about"      "their"      "past"      
## [31] "tragedies"

This time, tokens() segments our corpus into word tokens, but in the process, remove punctuation marks, numbers, symbols, Twitter handles and hashtags, and URLs. As discussed early, choosing what cleaning criteria to apply should be decided on a case by case basis. For some tasks, you may want to preserve Twitter handles and hashtags because they hold the key to understanding your data. You may also want to preserve punctuations if you are interested in how punctuations convey emotions.

Now, let’s apply tokens() to tokenize the whole corpus and identify the most frequent terms. The tokenization produces an object call tok. We can create a DFM from review_tok.

tok <- tokens(tweet_corpus,remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE)

tweets_dfm <- dfm(tok)

#find 20 most frequent terms
topfeatures(tweets_dfm, 20)

##    metoo      the       to        a      and       of       is       in 
##     4970     3337     2117     1772     1649     1627     1250     1177 
##        i      for      you     that       on     this       it movement 
##     1071      948      875      701      698      680      617      589 
##      are     with    women      amp 
##      557      519      486      447

Hold on, the most frequent terms are: the, to, a, and, of, is? This tells us hardly anything interesting about the data. Well, this is because there are common stop words in our data. Stop words are commonly used function words that we ask algorithms to ignore in processing text data. In many text analytics tasks, we reckon such stop words as meaningless noises. But, there is a school of thoughts that argue how people use function words (e.g., the, is, at, which, and on) reveals important insights about social identities, power dynamics in a communication context, and personality. Read more here.

To remove stop words from tok (the tokenized corpus created from the previous step), use the code below.

tok <- tokens_select(tok, pattern = stopwords('en'), selection = 'remove')

Alternatively, you can directly use remove = stopwords(“english”) inside the dfm() function. It will do the same trick.

tweets_dfm <- dfm(tweet_corpus, remove = stopwords("english"), stem = FALSE, remove_punct = TRUE)

Run the below code blocks. Each will test a new set of cleaning criteria.

tok <- tokens_select(tok, min_nchar=3, selection = 'keep') #remove any word shorter than3 characters.

tok <- tokens_select(tok, pattern = c("metoo","twitter","however"), selection = 'remove') #remove additional filter words.

The code above can be handy when we want to create our own list of filter words. Note that the tokenization process will automatically convert anything into lower cases, so instead of adding “However” as a filter word, we added “however.”

Ever wonder what stem = FALSE means in the code above? Or do you want to try other cleaning criteria? Stemming refers to a cleaning process that chops off the ends of words. Why do we do that? This is because different words (e.g., working, works, and worked) may share a same stem/root word and thus convey similar meaning. In text mining, we might want to treat such words as identical.

With the cleaned DFM, you can visualize 25 most frequent terms.

tweet_corpus <- corpus(metoo_tweets,docid_field = "status_id",text_field = "text")

tok <- tokens(tweet_corpus,remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE)

tweets_dfm <- dfm(tok,remove = stopwords("english"))

tweets_dfm %>% 
  textstat_frequency(n = 25) %>% 
  ggplot(aes(x = reorder(feature, frequency), y = frequency)) +
  geom_point() +
  coord_flip() +
  labs(x = NULL, y = "Frequency") +
  theme_minimal()

#Keywords

Now you are on course to try basic text mining techniques to extract insights from textual data. In this tutorial, we will try four techniques: word cloud, n-grams, and keyness.

Word cloud

We can create a word cloud using the most frequent terms in the tweets.

textplot_wordcloud(tweets_dfm, max_words = 100)

Keyness

Just because a term is frequent, does not mean it is a “keyword.” In text mining, keywords are measured by “keyness”. It shows how distinctive a word is for a group of texts when compared to the texts in other groups. For instance, we suspect that verified Twitter users and unverified users tweet about #metoo differently. We can use the textstat_keyness function to find terms that distinguish verified users’ tweets from the tweets from unverified users.

tweets_dfm <- dfm(tok, groups="verified") 

result_keyness <- textstat_keyness(tweets_dfm, target = "TRUE") 

# Plot estimated word keyness
textplot_keyness(result_keyness)

N-gram

Text mining is a process of data reduction. It simplifies a complex text into tokens. Along the way, lots of information is lost, especially, when you take single words out of context. Next, we will try N-grams. N-grams can preserve multi-word phrases and expressions.

Below we construct bigrams and compare frequent bigrams between two groups of tweets: those sent by verified users and the ones sent by unverified users.

tweet_corpus <- corpus(metoo_tweets,docid_field = "status_id",text_field = "text")

tok <- tokens(tweet_corpus,remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE)

tok <- tokens_ngrams(tok, n = 2)
tweets_dfm <- dfm(tok, groups="verified") 

set.seed(132)
textplot_wordcloud(tweets_dfm, comparison = TRUE, max_words = 100)

#Dictionary-based text analysis: sentiment detection

Researchers commonly use pre-defined and pre-tested dictionaries to identify themes and features in texts. The dictionary-based approach has been widely used in sentiment detection.During the 2012 US presidential election, Twitter, in partnership with several polling agencies, launched something called Twitter Political Index. The idea was to track candidates’ popularity among voters based on sentiment expressed in tweets. Back then, such idea was a novelty. Nowadays, sentiment analysis of social media text has been widely applied in marketing/PR, electoral forecasting, and sports analytics. The NPR show Planet Money even built a Twitter bot to automatically trade stocks based on sentiments in Trump’s tweets.

Below we use a library called syuzhet to identify sentiments expressed in tweets. The library uses the NRC Emotion Lexicon which rates words by eight dimensions: joy, anger, anticipation, disgust, fear, sadness, surprise and trust.

library(syuzhet) #this is the library for sentiment detection

#extract sentiments
Sentiment <- get_nrc_sentiment(metoo_tweets$text)

alltweets_senti <- cbind(metoo_tweets, Sentiment)

Next, we sum up across sentiment types and plot the data.

library(ggplot2)
library(ggthemes)
library(reshape2)
library(dplyr)

senti_aggregated <- Sentiment %>% 
  summarise(anger = sum(anger), 
            anticipation = sum(anticipation), 
            disgust = sum(disgust), 
            fear = sum(fear), 
            joy = sum(joy), 
            sadness = sum(sadness), 
            surprise = sum(surprise), 
            trust = sum(trust)) %>% melt

ggplot(data = senti_aggregated, aes(x = variable, y = value)) +
  geom_bar(aes(fill = value), stat = "identity") + geom_smooth(method = "lm", se = FALSE) + scale_color_fivethirtyeight("cyl") +
  theme_fivethirtyeight()

#A network approach towards text: semantic networks

To understand what a semantic network looks like, go ahead and run the code below.

library(quanteda)
library(ggplot2)

tweet_corpus <- corpus(metoo_tweets,docid_field = "status_id",text_field = "text")

tok <- tokens(tweet_corpus,remove_punct = TRUE,remove_numbers = TRUE, remove_symbols = TRUE, remove_twitter=TRUE, remove_url=TRUE)

tweets_dfm <- dfm(tok, remove = stopwords("english")) 

#create a feature co-occurrence matrix (FCM)
tweets_fcm <- fcm(tweets_dfm) 

#extract the top 50 frequent terms from the FCM object
feat <- names(topfeatures(tweets_fcm, 100)) 

#trim the old FCM object into a one that contains only the 50 frequent terms 
fcm_select <- fcm_select(tweets_fcm, pattern = feat)

set.seed(144)
textplot_network(fcm_select, min_freq = 0.8)

What you see in the output are a bunch of words interconnected to each other. A semantic network can tell us the most central concept/ideas in your corpus. We can use mathematical functions to quantify each position in a semantic network. For example, centrally located words tend to have higher betweenness centrality. Even without any mathematic calculation, we can easily spot the most central words based on the visualization. The concept of semantic networks is developed from the graph theory and social network analysis. We will have a separate set of tutorials on online social networks.

How are words connected to one and another? You may wonder about that. In quantaeda, a semantic network is referred to as feature co-occurrence matrix (FCM). It is a type of network based on co-occurrence: more specifically, two words are linked to each other if they appear in the same document. In our case, two words are connected to each other if they occur in the same review.

The semantic network approach can be useful in mapping out central ideas and see how different ideas are connected and clustered. A shortcoming here, though, is that there are too many vague words in the corpus. A refined approach is tagging each word token as noun, adjective, or verb (a process called Part-of-speech tagging) and then create a semantic network based only on one type of words, say, a semantic network entirely based on nouns or adjectives. You may also consider this approach in analyzing Twitter hashtags. You will learn alot by looking at how different hashtags are concentrated and clustered based on their co-occurrence in the same tweet. See the code below.

tweet_corpus <- corpus(metoo_tweets,docid_field = "status_id",text_field = "text")

tweets_dfm <- dfm(tweet_corpus) 

tweets_dfm <- dfm_select(tweets_dfm, pattern = ("#*"))

tweets_fcm <- fcm(tweets_dfm) 

#extract the top 50 frequent terms from the FCM object
feat <- names(topfeatures(tweets_fcm, 100)) 

#trim the old FCM object into a one that contains only the 50 frequent terms 
fcm_select <- fcm_select(tweets_fcm, pattern = feat)

set.seed(144)
textplot_network(fcm_select, min_freq = 0.8)

#Topic models

What is a topic model?

Have you dreamed of a day when algorithms can quickly scan through your textbooks and give you a bullet point summary? How convenient! No more tedious reading! Actually, there are algorithms out there that do automatic summarization of large-scale corpus. They are called topic models. In building topic models, we basically ask computers to discover some abstract topics from the text. The internal logic is this: words about the same topic tend to be used together in the same or adjacent semantic space.

New library?

The topic model feature is built in the quantaeda library. But you may need to install_topicmodels_.

The workflow

Run the code below and wait untill you get an output.

As you can tell, producing topic models takes time–it is computationally intensive if you hope to build a topic model based on a large-scale corpus.

In quanteda, we convert a regular DFM object into a format ready for building topic models (see the convert() function in the code above). We then apply the LDA() function to build a topic mode. LDA stands for Latent Dirichlet allocation. It is one of the most commonly used topic modeling algorithms. terms(lda, 10) gives 10 frequent terms from each topic.

But, what does k=10 mean?

By setting k=10, we ask the LDA algorithm to identify ten topics in the corpus. This is where things get extremely complicated and confusing. The algorithm is agnostic about how many topics are there in the corpus. So you will start with a k value and check if the keywords returned shows any clear distinction of topics. If you struggle to summarize the text based on the keywords returned, chances are it is not a good model. Then you will go back and adjust the k value until you find one that gives you somewhat sensible output.

Tuning up a topic model is an art. It is like tuning up a telescope. So based on the corpus we have, let’s try different k values. This time, run the code in your own laptop.

If you want to plot key words from each topic model, try the following code in the library tidytext.

library(tidytext)
library(topicmodels)

dfm_fortm <- convert(tweets_dfm, to = "topicmodels")
lda <- LDA(dfm_fortm, k = 10)

tuber_topics <- tidy(lda, matrix = "beta")

top_terms <- tuber_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Extracting meaning?

Based on your experience with topic models so far, do you think computers can extract true meanings from textual data?

#What can a network tell us?

Have you wondered how information spreads on Twitter, how Instagram influencers are identified, and how different actors in an online community collaborate or confront one and another? There are the sorts of questions that can be best answered using network analysis and network visualization. In network analysis of internet communities, we visualize and quantify the structure of social relationships and information flows. See a real-world application my team has built to track the upcoming Philippine General Election.

Here, you can see a retweet network based on 9,999 #BreakUpBigTech tweets. In this network, a pair of users represents a retweeting relationship. That is, two users are connected to one and another if one retweets or is retweeted by the other. For simplicity, the graph below only shows users who at least twice retweeted or were retweeted by others.

Guess how the size and color of a node is determined.

#Edges

Where do we begin to visualize a network? It all starts with nodes and edges. The table below shows 20 tweets.

An edgelist shows all edges in a network along with attributes of the edges. An edge is a pair of relationship between two nodes (in this case, users). An edge can be directed: for example, A retweets B will be expressed as User A → User B, whereas B retweets A is expressed as User B → User A. But, in some cases, an edge is undirected. Think about your Facebook relationships. If user A is a friend of user B. By default, user B is also connected to user A.

An edgelist based on the 20 tweets looks like this. The column source lists the Twitter users who retweeted. The column target shows those users who were on the receiving end of the retweets (i.e., users who were retweeted by others). The size column is edge weight, referring to the number of retweeting that occurred between the same pair of users.

#Nodes

In our example, a node is a Twitter user. Below is a list of nodes, with their id, labels, and attributes (e.g., size). Wonder how size is determined? We will cover this in the later part of the tutorial.

#How to turn tweets into network?

Here comes the real deal: how to turn collected tweets into a network. Previously, the process involves several steps of text cleaning to extract relevant @screennames. A new library called graphtweets[http://graphtweets.john-coene.com/index.html] makes the task much easier. Graphtweets works seamlessly on Twitter data collected through rtweet[https://rtweet.info/]. You can easily tweak the code in graphtweets to make it work for data that come from different shapes and sizes. Below, I will use the data frame tweets as a demo. tweets contains 9,999 tweets that use #BreakUpBigTech.

Make sure graphtweets is installed and loaded. We begin by defining a R function. A function is a set of codes organized together to perform a specific task. R has a large number of built-in functions. We can also create our own functions. A self-defined function will save a lot of repetitive work. See how I define a function called extractrt below.

library(graphTweets)

extractrt <- function(df){
  rt <- df %>% 
    gt_edges(screen_name, retweet_screen_name) %>% # get edges
    gt_nodes() %>% # get nodes
    gt_collect() # collect
  
  return(rt)
}

The self-defined function extractrt takes in df (it will have to be a data frame from rtweet in order for the function to work). The function then uses three in-built functions from graphtweets to extract nodes and edges from tweets in df. In a standard data frame returned from rtweet, the sender of a retweet (the user who retweets other) is in the screen_name column, and the retweeted users are in the retweet_screen_name column. See how this particular process is spelled out below.

 gt_edges(screen_name, retweet_screen_name) %>% # get edges

The function extractrt creates and returns an object called rt. rt can be easily converted to an igraph object. igraph is one of the most common network analysis libraries in R. We will deal with igraph later.

After a function is defined in R, we can apply it to a data frame of tweets. Below, we apply extractrt to tweets and create rtnet.

rtnet <- extractrt(tweets)

The reason we call this function extractrt is that it only extracts retweeting relationships. How about Twitter mentions/replies. A Twitter mention/reply signifies a more engaged mode of interaction. A retweet is mostly a passive information relay, but a Twitter mention/reply is an active outreach. In the code below, I define extractmt as a function for extracting edges and nodes in Twitter mentions/replies. This function scans tweets and extracts @screennames in the screen_name and mentions_screen_name column. I then apply extractmt to tweets and create mtnet.

extractmt <- function(df){
  
  mt <- df %>% 
    gt_edges(screen_name, mentions_screen_name) %>% # get edges
    gt_nodes() %>% # get nodes
    gt_collect() # collect
  
  return(mt)
}

mtnet <- extractmt(tweets)

Now that we have two network objects: rtnet and mtnet. We want to take a look at them. But unlike data frames, you cannot just click to view a network object. You can use the two following self-defined functions to get node lists and edgelists from the two network objects.

#define a function called nodes to extract node information from a network object

nodes <- function(net){
  
  c(edges, nodes) %<-% net
  nodes$id <- as.factor(nodes$nodes) 
  nodes$size <- nodes$n 
  nodes <- nodes2sg(nodes)
  nodes <- nodes[,2:5]
  
  return(nodes)
}

#define a function called edges to extract edge information from a network object

edges <- function(net){
  
  c(edges, nodes) %<-% net
  edges$id <- seq(1, nrow(edges))
  edges <- edges2sg(edges)
  
  return(edges)
}

#apply the two self-defined functions
rtnet_nodes <- nodes(rtnet)
rtnet_edges <- edges(rtnet)

mtnet_nodes <- nodes(mtnet)
mtnet_edges <- edges(mtnet)

From the above step, we create four objects: rtnet_nodes, rtnet_edges, mtnet_nodes, mtnet_edges. The four objects are all data frames. Let’s take a look at rtnet_edges. It is an edgelist.

library(DT)
datatable(rtnet_edges, options = list(pageLength = 5))

#Convert to igraph

Network analysis is essentially a mathematical process. Any user and any network can be scored based on some attributes. To do this, we will convert our network objects into igraph objects. For example, for the retweet network, we can create an igraph object based on rtnet_edges and rtnet_nodes. See the code and comments below.

# use rtnet_edges as the edgelist and rtnet_nodes as the node list. Set the network type as directed

rt <- graph_from_data_frame(d=rtnet_edges, vertices=rtnet_nodes, directed=T) 

# see edge weight by copying the values from the size column in rtnet_edges

rt <- set_edge_attr(rt, "weight", value= rtnet_edges$size)

# we do the same for the mention network

mt <- graph_from_data_frame(d=mtnet_edges, vertices=mtnet_nodes, directed=T) 
mt <- set_edge_attr(mt, "weight", value= mtnet_edges$size)

But first, let’s just take a look at some network-level indicators.

#Size matters! A quick way to compare different networks (e.g., the retweet network vs. mention network) is looking at its size. Run the code below to get a count of edges and nodes in rtnet and mtnet.

Which network has more users in it? And which network has more connections?

vcount(rt) #this shows the number of nodes/vertices in rt

## [1] 9092

vcount(mt) #this shows the number of nodes/vertices in mt

## [1] 9322

ecount(rt) #this shows the number of edges in rt

## [1] 9137

ecount(mt) #this shows the number of edges in mt

## [1] 10399

#Dense or sparse?

A densely connected network (high density score) is a type of network in which many users are interconnected, whereas a sparse network (low density) is a network in which only a few are interconnected. Two contrasting examples of dense and sparse networks are a network of people in a family gathering in which almost everyone knows everyone else, and a network of people sitting on a public bus.

Which network is more interconnected?

edge_density(rt, loops = FALSE) #the density of rt

## [1] 0.0001105433

edge_density(mt, loops = FALSE) #the density of mt

## [1] 0.0001196796

#Centralized or decentralized?

Think of centralization as a question of inequality and who is in control. In a centralized network, a small number of nodes (users) control the information flow. In a retweet network specifically, it means that only a handful of users retweet or are retweeted by others. Centralized and decentralized networks have different ramification for the diffusion of ideas, norms, and effective mobilization.

by setting mode = c(“in”), we calculate the centralization score based on the extent to which users are retweeted by others (as opposed to retweet others).

So, which network is more centralized?

#Calculate centralization
centr_degree(rt, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization

## [1] 0.95006

centr_degree(mt, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization

## [1] 0.936797

#Birds of a feathre flock together?

Have you heard of the saying birds of a feather flock together? In a network, nodes tend to cluster together based on some shared attributes. For instance, Twitter users may retweet mostly content they agree with. Hence, this tendency will result in a cluster of nodes based on similar mindsets or opinions. To what extent is a network reflecting this pattern of clustering can be quantified by using clustering coefficient.

transitivity(rt)

## [1] 1.929652e-06

transitivity(mt)

## [1] 0.0001250248

#Is it reciprocal?

Reciprocity is calculated as the proportion of reciprocated ties. In the retweet network, for example, reciprocity shows the extent to which a pair of users have mutually retweeted one and another.

Which form of Twitter interactions (retweet vs. mention) is more reciprocal?

reciprocity(rt)

## [1] 0.000219082

reciprocity(mt)

## [1] 0.001733436

#Look for influencers

I have introduced previously a range of indicators to quantify a network. Such indicators are only useful when it involves a comparison of different networks. When analyzing one single network, we are more interested in node-level indicators.

A common task in network analysis is identifying influencers? An influencer could mean different things to different people. Here we try a couple of dfferent metrics.

indegree centrality measures the number of incoming connections a user has received. A high indegree in the retweet network means that the user is frequently retweeted by others. Do you agree that the most retweeted users are influencers? And why?

indegree_rt <- sort(degree(rt,mode = "in"),decreasing = TRUE)
indegree_rt[1:10] #show the top 10 users ranked by in-degree

##         ewarren     anandwrites      omanreagan   stclairashley 
##            8638             107              40              40 
##        guardian      chadfelixg          a35362         soyrosa 
##              39              20              19              16 
##         jc_cali myth_capitalism 
##              15              14

outdegree centrality measures the number of outgoing connections a user has. A high outdegree in the retweet network means that the user frequently retweets other users. What would you call such users, mobilizers?

outdegree_rt <- sort(degree(rt,mode = "out"),decreasing = TRUE)
outdegree_rt[1:10] #show the top 10 users ranked by out-degree

## edwood05572006         raqb16   damonbethea1    fuelgrannie   gavin_bonnar 
##             13              6              5              5              4 
##   atheist_cvnt     natemezmer  philippejouan  sharonresists   tedgrunewald 
##              3              3              3              3              3

Betweenness centrality measures the number of times a node lies on the shortest path between other nodes. We use this metric to find users who act as ‘bridges’ between nodes in a network and who influence the information flow around a network.

bt <- sort(betweenness(rt, directed=T, weights=NA), decreasing = TRUE)
bt[1:10] #show the top 10 nodes by betweenness centrality

##      omanreagan      chadfelixg         soyrosa    damonbethea1 
##              32              18              16              13 
##         jc_cali    commondreams  resistasista76          yitzee 
##               8               4               3               3 
##      tuxcedocat vivek_gkrishnan 
##               2               2

Ever wonder how Google ranks search results? It uses the PageRank algorithm developed by Google’s founders Sergey Brin and Larry Page. We can use PageRank to locate influencers as well.

pr <- page_rank(rt, algo = c("prpack"))
pr <- sort(pr$vector,decreasing = TRUE)
pr[1:10] #show the top 10 users ranked by PageRank

##         ewarren     anandwrites myth_capitalism      omanreagan 
##    0.4338045988    0.0047293401    0.0042530946    0.0042105884 
##   scottrickhoff   stclairashley        guardian      chadfelixg 
##    0.0020597744    0.0017596358    0.0016080549    0.0010092895 
##          a35362         soyrosa 
##    0.0009209153    0.0008592202

#Look for clusters/cliques

We use community detection algorithm to cluster users into different groups (we call such groups clusters or cliques). Users in the same cluster are more connected with one and another than with users outside of the cluster. By using the community detection method, we can reveal important divisions and fragmentation that exist due to different opinions, values, and user characteristics.

Some community detection algorithms require intensive computating. It may take a long time to produce an output.

k-core

Creating k-core is fast and easy. We can use k-core to identify a small subset of users who are the most interconnected. In a k-core, each node has at least k connections with everyone else. Below we extract a 2-core (named twocore) in which each user has at least 2 edges with any other users in the core.

kcore <- coreness(rt, mode="all") 
twocore <- induced_subgraph(rt, kcore>=2)

edge betweenness (Newman-Girvan)

This is one of the community detection algorithm that is computationally intensive. Be patient when it is crunching numbers for you.

The code above creates an object call ceb. It contains the information about which cluster each node belongs to. We can run the code below to see the cluster ID of the first 10 nodes.

ceb <- cluster_edge_betweenness(rt) 

print("there are",length(ceb),"clusters based on this community detection algorithm")

membership(ceb)[1:10] #list only 10 nodes.

#Visualize a network and make it pretty!

There are many ways to visualize a network. You can make the visualization static or interactive as this example shows. You can even create a dynamic one showing the evolution of a network over time an example.

Below, we will try some of the basics using two libraries igraph and VisNetwork. igraph comes with some in-built functions for visualization. VisNetwork takes a step further by making it prettier and interactive.

Before you visualize a network, here are the decisions you need to make:

do you want to assign colors to nodes based on some node attributes?
do you want to set the node size based on some attributes?
do you want to show all nodes?

In our example below, we color nodes based on the clusters they belong to. We set the node size based on PageRank score (the famous scoring technique used by Google), with central nodes represented by bigger nodes. And we don’t want to show all nodes as that will create a messy network; Instead, we would show only the most interconnected subset (using k-core).

In the previous steps, we know the codes for calculating node and network-level metrics (e.g., centrality). Here, we will pass the metrics to nodes and store them as node attributes. This will allows the visualization code to pick up the attributes and use them for sizing and coloring.

In the code below, we add PageRank score (used for node size) and the cluster id (used for assigning color). We use V(rt) to access node attributes and E(rt) to access edge attributes.

library(igraph)
library(visNetwork)
library(scales)

pr <-page_rank(rt, algo = c("prpack"))
V(rt)$size <- pr$vector*100  #set node size by PageRank scores.

wc <- cluster_walktrap(rt)

V(rt)$color <- membership(wc) # set color by subgroup id

Since we visualize only the 2-core. We create a subset of the network.

kcore <- coreness(rt, mode="all") 
twocore <- induced_subgraph(rt, kcore>=2)

Find a visualization algorithm that fits

And we visualize it. Notice that we set layout = “layout_nicely”? This is how we specify which visualization algorithm to use. There is a whole bunch of them: see the listing. If you are curious about visual effects from different algorithms, try layout =“layout_in_circle” or layout =“layout_with_kk” or layout =“layout_with_sugiyama”

visIgraph(twocore,idToLabel = TRUE,layout = "layout_nicely") %>%
  visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

A work by Weiai Wayne Xu

curiositybits.cc

COMM621 Quantitative Research Methods

Weiai Wayne Xu
curiositybits.cc
Department of Communication, UMass-Amherst

2019-04-21

Regression

COMM621 Quantitative Research Methods

Weiai Wayne Xu curiositybits.cc Department of Communication, UMass-Amherst

2019-04-21

Regression

Weiai Wayne Xu
curiositybits.cc
Department of Communication, UMass-Amherst