This is a machine-assisted qualitative content analysis of class notes prepared by seven participants of a reading group for a Winter 2021 PhD Seminar at McGill University. Every week we each choose among the assigned readings and prepare summary notes for the rest of the group, to aid their reading. The texts of these notes are the raw data for this analysis.
Our stated goals are to use this analysis to explore content analysis methods and to look for semantic structure in our note-taking. This R vignette will capture the entire analysis pipeline. As with any halfway-decent machine-assisted qualitative analysis, it helps to have actually read the original texts.
For the lay reader: the nature of our PhD seminar, the intellectual proclivities of our professor, and our own linguistic tendencies will be made manifest in our analysis of the text.
#preamble
rm(list=ls()) #clear memory
setwd("~/Documents/OneDrive - McGill University/R/Projects/notes710") #working directory
#required packages
library(knitr) #publishing this notebook
library(gsheet) #accessing raw data from google sheets
library(tidyverse) #data manipulation
library(data.table) #data manipulation
library(tidytext) #text manipulation
library(textmineR) #topic modeling
library(philentropy) #clustering topics
library(ggplot2) #plot
library(RColorBrewer) #plot
The first step in this process is cleaning and arranging the raw data, the texts. I have pasted the text from the notes into a google sheet (sans formatting), one row per document. Conceptually, each reading summary written by each note-taker constitutes one “document” in our corpus. There is information in formatting. For example, whether an author prefers to use bullets or prose, or how they organize text into paragraphs tells us something about them. In this analysis, however, we do not consider formatting.
An inspection of the data reveals that some of the readings are related, so we may expect notes about related readings to be more similar to each other than to other notes in some ways. We may also expect notes written by the same noter to be more similar in other ways to each other than notes written by different noters.
Now, import it all into R.
data1<-gsheet2tbl('docs.google.com/spreadsheets/d/1c2eWivCgCpU74-U0paQXfWAWNKo_O1euxua7JlXbJWo') #use the sharing/viewing url for the sheet here
print(data1,n=7) #print first 7 rows
## # A tibble: 48 x 7
## noteid noter author year title selection notes_noformat
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 04YJL a1 Sartori, G. 1970 "Concept misf… 1033-1053 "Introduction A cons…
## 2 04AGR a2 Goertz, G.… 2012 "Concepts, De… Ch10 "Overview The chapte…
## 3 04AB a3 Prasad, V.… 2015 "Surrogate Ou… Ch3 "Context-Free Takeaw…
## 4 04MK a4 Shadish, W… 2002 "Construct Va… pp. 64-82… "Relationship: The c…
## 5 04JS a5 Shadish, W… 2002 "Construct Va… second 9 "Threats to Construc…
## 6 04JN a6 Grigoryeva… 2015 "The historic… All "In this article, th…
## 7 04AMB a7 Collier, D… 1993 "Conceptual\"… All "This discussion pap…
## # … with 41 more rows
Next, we tokenize this text. This means we’re going to transform the data into one row per word. This helps us do some basic counting and cleaning. We’ll reconstitute it back into one row per document before modeling it. Note that you can choose to tokenize by bigram (two words per row), trigram (three words per row), line, sentence, paragraph, etc depending on your unit of analysis. Because we disaggregate texts like this, it is critical that you have a unique identifier for each document. I like my unique identifiers to also be informative. Here, noteid is informative of the “week” of class and the author of the note.
This “only one thing per row” concept is core to the Tidy ’verse. We’ll be using a lot of Tidy methods here. Tidy is a disciplined way of thinking about data architecture that facilitates a relatively simple but versatile programming syntax that allows us to ask a huge number of empirical research questions. We will of course, frequently make the data unTidy for various practical reasons.
This is what the data looks like now. All punctuation and capitalization is now stripped out.
notes<-data1[,c(1,2,7)] # I only want NoteID, Noter, and Text
notes_unnested<-notes %>%
unnest_tokens(word,notes_noformat) #unnest column notes_noformat into words.
print(notes_unnested,n=15)
## # A tibble: 52,587 x 3
## noteid noter word
## <chr> <chr> <chr>
## 1 04YJL a1 introduction
## 2 04YJL a1 a
## 3 04YJL a1 conscious
## 4 04YJL a1 thinker
## 5 04YJL a1 is
## 6 04YJL a1 a
## 7 04YJL a1 person
## 8 04YJL a1 who
## 9 04YJL a1 is
## 10 04YJL a1 aware
## 11 04YJL a1 of
## 12 04YJL a1 the
## 13 04YJL a1 assumptions
## 14 04YJL a1 and
## 15 04YJL a1 implications
## # … with 52,572 more rows
Note the information around the data. This is a “tibble” with certain number of rows x columns. This is a one word per row table so that means there are “number of rows of the notes_unnested table” words in our corpus.
For the number of unique words in the corpus, look at the row count of the notes_words table.
notes_words<-notes_unnested %>% # in the notes_unnested table,
count(word,sort=TRUE) # count how many times each word appears
print(notes_words,n=15)
## # A tibble: 5,938 x 2
## word n
## <chr> <int>
## 1 the 3212
## 2 of 2048
## 3 to 1441
## 4 and 1180
## 5 a 1015
## 6 in 990
## 7 is 743
## 8 that 591
## 9 for 487
## 10 are 418
## 11 on 391
## 12 be 359
## 13 with 347
## 14 not 310
## 15 treatment 304
## # … with 5,923 more rows
Now we can start looking at things like most frequent words by noter.
n<-10
noter_topw<-notes_unnested %>% #in the notes_unnested table,
count(noter,word,sort=TRUE) %>% #count unique noter-word combinations,
group_by(noter) %>%
top_n(10) %>% #and show me top "n" words used by noter
arrange(noter)
#make the table "wide" so its easier to read.
noter_topw_w<-data.frame(matrix(0,nrow=n,ncol=length(unique(noter_topw$noter))))
for(i in 1: length(unique(noter_topw$noter))) {
noter_topw_w[,i]<-head(noter_topw[noter_topw$noter==unique(noter_topw$noter)[i],2],n)
}
names(noter_topw_w)<-unique(noter_topw$noter)
head(noter_topw_w,n=n)
## # A tibble: 10 x 7
## a1 a2 a3 a4 a5 a6 a7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 the of the the the the the
## 2 of the of of of of of
## 3 to and a to to to to
## 4 and to to and and a a
## 5 a a is in in is and
## 6 in in and a a in in
## 7 is is in for that and is
## 8 are that that is is that that
## 9 be with for that for case are
## 10 causal on or on be study p
You’ll notice a lot of words like “the” and “of”. Lets remove them. We’ll use a pre-existing lexicon of “stop words” to remove them from our corpus, and redo the top word exercise. Some of our semantic “personality” is likely reflected in how we use these words, so we’re losing some of that information when we remove stop words.
data(stop_words) #load lexicon
notes_unnested_ns <- notes_unnested %>%
anti_join(stop_words)
n<-10
noter_topw_ns<-notes_unnested_ns %>%
count(noter,word,sort=TRUE) %>%
group_by(noter) %>%
top_n(10) %>%
arrange(noter)
noter_topw_w_ns<-data.frame(matrix(0,nrow=n,ncol=length(unique(noter_topw_ns$noter))))
for(i in 1: length(unique(noter_topw_ns$noter))) {
noter_topw_w_ns[,i]<-head(noter_topw_ns[noter_topw_ns$noter==unique(noter_topw_ns$noter)[i],2],n)
}
names(noter_topw_w_ns)<-unique(noter_topw_ns$noter)
head(noter_topw_w_ns,n=n)
## # A tibble: 10 x 7
## a1 a2 a3 a4 a5 a6 a7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 causal status bias treatme… treatme… study experime…
## 2 treatment concept causal causal referred treatment field
## 3 selection unionization research control workers bribery category
## 4 slaves research outcome analysis 2 research categori…
## 5 variable data variable albums job unit subjects
## 6 theory effects effect effect analysis causal populati…
## 7 research qual treatment assigned causal baseline effect
## 8 slave ietf selection study control intervent… experime…
## 9 observatio… organization… observation… data model researche… e.g
## 10 variables quality outcomes outcome sample i.e 1
Note: cleaning your analysis corpus “automagically” has some limitations. Words like “cases” and different" get cleaned out. I’m not sure I agree that these are meaningless “filler” words. But the alternative is to manually remove stop words one by one, and that can get very tedious. So “anti-joining” is an easy option. Anti-joining A and B means removing all elements of A \(\bigcap\) B from A.
Lets also check what the stop word removal did to the size of our corpus.
#wordcount
dim(notes_unnested_ns)[1] #number of rows. For columns, [2]
## [1] 26149
#unique words
notes_words_ns<-notes_unnested_ns %>%
count(word,sort=TRUE)
dim(notes_words_ns)[1]
## [1] 5413
Are the top words of each noter, stripped of stop words, informative of what we are writing “about?” What can we infer about the sort of class that the noters are taking? Looks pretty clear at this point that its a methods class. However, because we are talking about multiple authors, multiple documents per author, something aggregative at the level of the whole corpus might be more useful.
So now, we turn to LDA topic modeling. I highly recommend at this point you first go read this. If you’re feeling feisty, try the Wikipedia articles for Distributional Semantics and Latent Dirichlet Allocation.
Ok, all done? Now lets proceed. I’m going to work with the “no stop word” version of the corpus, so first I need to reconstitute it back into one row per document.
notes_reconst_ns<-notes_unnested_ns[,] %>%
group_by(noteid) %>% #one row per note
mutate(ind=row_number()) %>%
tidyr::spread(key=ind,value=word) # this creates a column for each word in each note
notes_reconst_ns[is.na(notes_reconst_ns)] <-""
notes_reconst_ns<-tidyr::unite(notes_reconst_ns,notes_ns,-c("noteid","noter"),sep=" ",remove=T) #stitch one-word columns together
print(notes_reconst_ns,n=7) #print first 7 rows
## # A tibble: 47 x 3
## # Groups: noteid [47]
## noteid noter notes_ns
## <chr> <chr> <chr>
## 1 04AB a3 "context free takeaway elements causal mechanisms hard measure r…
## 2 04AGR a2 "overview chapter quantitative qualitative approaches conceptual…
## 3 04AMB a7 "discussion paper offers guidelines challenges comparative analy…
## 4 04JN a6 "article authors propose measure residential segregation measure…
## 5 04JS a5 "threats construct validity issues construct validity occur lack…
## 6 04MK a4 "relationship condition related connection association connectio…
## 7 04YJL a1 "introduction conscious thinker person aware assumptions implica…
## # … with 40 more rows
Now, to create a document term matrix. This is a sparse matrix (lots of blanks) with one row per document and one column per “term.” At its most basic, a term is a word. For our analysis, I’ll be using both words and bigrams (two-word combinations).
To illustrate, the phrase “the quick brown fox” consists of the terms the + quick + brown + fox + the quick + quick brown + brown fox. Note that we have stripped our text of stopwords, so some of these bigrams may not accurately reflect their actual use.
notes_ns_dtm <- CreateDtm(notes_reconst_ns$notes_ns,
doc_names = notes_reconst_ns$noteid,
ngram_window = c(1, 2))
Now, for an important step. LDA operates under the assumption that any corpus of documents can be described as a mixture of some pre-determined number of topics or themes. So each document can be described using a vector of probabilities. However, the length of that vector, a “correct” number of topics “k” needs to pre-specified. There are many ways at coming up with some “best” number of topics.
I choose to optimize for semantic coherence, a measure of how distinct the meaning of one topic or theme is from the other topics present in the field. Partly because this makes sense as something to optimize for topics, and also because the developer of the textmineR package has spent a lot of time thinking about coherence, and is very thoughtful in the implementation. We are looking for the number of topics that maximizes the average coherence of topics, so we need to search through some range of parameter space.
So where do we start? One way to do it, when you’re new to a type of corpus, is to start low, around 5 topics, and cast a wide net, all the way up to around 50 topics. Over time, you get a feel for the range of topics in a particular type of corpus, and then you can start with a narrower range. This is important, because the search process can be very time consuming for large corpora. You’re basically estimating an LDA topic model of your text with 5 topics, then 6 topics, then 7, and so on, and for large corpora, each estimation may take a lot of time. Here, I start at 40 topics because I know based on prior runs that sub-40 topics models have low coherence with this dataset.
coh<-data.frame(k=seq(40,60,by=1),coh=0)
#create a vector to store the results of our search
for (i in 1:dim(coh)[1]) {
set.seed(12345)
notes_ns_lda<-FitLdaModel(dtm = notes_ns_dtm, k = coh[i,1], iterations = 500)
coh[i,2]<-mean(notes_ns_lda$coherence)
rm(notes_ns_lda)
}
plot(coh$k,coh$coh,type="l",xlab="Number of Topics",ylab="Average Coherence of Topics")
Looks like we have a winner. Now lets look at a summary of topics in our corpus
k<-coh[coh$coh==max(coh$coh),1]
set.seed(12345)
notes_ns_lda<-FitLdaModel(dtm = notes_ns_dtm, k = k, iterations = 5000) # more iterations for the "final" run
notes_ns_lda_sum<-SummarizeTopics(notes_ns_lda)
kable(notes_ns_lda_sum)
| topic | label_1 | prevalence | coherence | top_terms_phi | top_terms_gamma |
|---|---|---|---|---|---|
| t_1 | status_author | 1.50 | 0.297 | status, author, ietf, publication, al | ietf, publication, chair, signals, id |
| t_2 | treatment_control | 1.67 | 0.108 | cluster, means, variance, standard, treatment_control | cluster, clusters, fisher, variances, standard_error |
| t_3 | residential_segregation | 2.39 | 0.441 | segregation, measure, validity, residential, census | segregation, residential, households, residential_segregation, spatial |
| t_4 | grounded_theory | 3.16 | 0.210 | theory, theoretical, grounded, grounded_theory, strauss | grounded, strauss, grounded_theory, glaser_strauss, glaser |
| t_5 | concept_definition | 1.96 | 0.500 | concept, qual, indicators, quant, measurement | qual, indicators, quant, concept_definition, concept |
| t_6 | causal_inference | 2.26 | 0.235 | inference, causal_inference, causal, unit, units | causation, joint, exposure, autor, exposure_variables |
| t_7 | qualitative_researchers | 2.96 | 0.110 | researchers, qualitative, sampling, quantitative, studies | quantitative_researchers, unique, sampling_methods, statistically_representative, based_studies |
| t_8 | regression_discontinuity | 0.94 | 0.085 | regression, subjects, study, threshold, compliers | compliers, discontinuity_designs, crossover, units_assigned, fuzzy_regression |
| t_9 | retrospectively_consecrated | 1.04 | 0.970 | album, consecration, consecrated, music, critics | album, consecration, consecrated, music, professional |
| t_10 | treatment_control | 1.77 | 0.057 | treatment, construct, participants, constructs, control | participants, innovation, administrators, assess_presence, combining |
| t_11 | abandon_prematurely | 3.35 | 0.206 | causal, effect, outcome, potential, causal_effect | potential, causal_effect, causal_effects, outcome_treatment, outcomes_treatment |
| t_12 | trial_sample | 1.65 | 0.220 | randomization, sample, rct, treatment, trial | rct, trial, rcts, ate, perfect |
| t_13 | regression_discontinuity | 1.44 | 0.167 | treatment, assigned, control, units, regression_discontinuity | assigned_control, regression_discontinuity, instrumental_variables, assigned_treatment, instrumental |
| t_14 | sign_object | 1.75 | 0.621 | abduction, induction, deduction, observations, theories | abduction, deduction, induction, peirce, abductive |
| t_15 | selective_sampling | 2.22 | 0.187 | sampling, parameter, selective, populations, time | selective, adopters, selective_sampling, populations, beta |
| t_16 | field_experiments | 1.27 | 0.099 | experiments, field, subjects, lab, field_experiments | field_experiments, lab, social_experiments, substitutes, naturally |
| t_17 | construct_validity | 2.18 | 0.220 | construct, validity, construct_validity, constructs, match | construct_validity, match, particulars, prototypical, sampling_particulars |
| t_18 | free_spaces | 0.96 | 0.602 | change, spaces, reformers, advent, bayshore | spaces, reformers, advent, bayshore, defenders |
| t_19 | dependent_variable | 1.69 | 0.178 | variable, dependent, observations, dependent_variable, selection | dependent_variable, explanatory_variables, explanatory, dependent, indeterminate |
| t_20 | sampling_error | 1.14 | 0.397 | sna, analysis, lna, model, sample | sna, lna, mb, mb_sna, se |
| t_21 | target_population | 2.07 | 0.084 | population, sample, person, target, network | network, chain, target_population, seeds, target |
| t_22 | method_agreement | 2.42 | 0.088 | causal, independent, method, methods, variable | agreement, mill, method_agreement, method_difference, variable_independent |
| t_23 | selection_bias | 2.32 | 0.189 | bias, selection, selection_bias, outcome, endogenous_selection | selection_bias, endogenous_selection, bias, successful, offers |
| t_24 | wage_distribution | 1.01 | 0.285 | job, wage, change, time, workers | wage, retooling, job, skill, percentile |
| t_25 | silver_blaze | 2.01 | 0.216 | tests, test, evidence, analysis, hypothesis | process_tracing, tracing, tests, tannenwald, blaze |
| t_26 | slave_trades | 1.92 | 0.979 | slaves, slave, slave_trades, trades, africa | slaves, slave, slave_trades, trades, africa |
| t_27 | free_spaces | 0.89 | 0.481 | programs, residents, interns, oppositional, night | interns, oppositional, residents, programs, relational |
| t_28 | potential_outcome | 2.88 | 0.160 | treatment, individual, control, potential_outcome, population | catholic, yi_yi, constitutive, constitutive_features, effect_hospitalization |
| t_29 | abandon_prematurely | 0.89 | 0.192 | information, task, time, referrals_performance, pieces | referrals_performance, monitoring_treatment, turnover, job_referred, psa |
| t_30 | people_identify | 2.87 | 0.121 | context, people, ethnographic, develop, identify | audiences, fieldwork, financial, ethnographic, discerning |
| t_31 | potential_outcome | 2.43 | 0.036 | variables, researcher, sample, relationship, yi | block, block_door, computing, iv_dv, quasi |
| t_32 | gram_panchayats | 2.06 | 0.645 | women, gram, panchayats, gram_panchayats, reserved | gram, panchayats, gram_panchayats, reserved, council |
| t_33 | regulatory_compliance | 2.04 | 0.300 | compliance, organizational, organizations, research, regulation | compliance, legal, regulator, regulatory, regulatory_compliance |
| t_34 | civil_society | 2.15 | 0.600 | local, schlesinger, voluntary, civic, government | schlesinger, voluntary, civic, civil_society, classic |
| t_35 | healthcare_providers | 1.66 | 0.486 | bribery, baseline, intervention, threat, reputational | bribery, healthcare, healthcare_providers, providers, reputational_threat |
| t_36 | ladder_abstraction | 2.98 | 0.189 | empirical, conceptual, concepts, political, extension | abstraction, ladder_abstraction, universals, extension, relevance_potential |
| t_37 | observational_research | 2.71 | 0.131 | research, observational, experimental, observational_research, relationship | observational_research, radical, skepticism, experimental_research, radical_skepticism |
| t_38 | treatment_assignment | 2.74 | 0.279 | natural, treatment, experiments, natural_experiments, assignment | treatment_assignment, cpos_information, confounders, natural_experiments, causal_statistical |
| t_39 | directed_graphs | 2.38 | 0.156 | variable, causal, graphs, path, directed | conditioning, graph, directed, graphs, path |
| t_40 | abandon_prematurely | 2.38 | 0.099 | causal, process, qualitative, variable, research | causal_process, data_set, events, level_analysis, focus |
| t_41 | smoking_gun | 1.41 | 0.703 | cpos, gun, smoking_gun, observations, smoking | cpos, gun, smoking_gun, smoking, hoop_test |
| t_42 | referred_workers | 1.29 | 0.249 | referred, performance, workers, referrers, referred_workers | referred, referrers, referred_workers, referrals, referred_referred |
| t_43 | field_experiment | 1.00 | 0.291 | performance, privacy, data, organizational, observability | privacy, observability, organizational_learning, paradox, actions |
| t_44 | retrospectively_consecrated | 1.14 | 0.810 | albums, recognition, popular, film, cultural | popular, film, recognition, odds, albums |
| t_45 | surrogate_outcome | 2.13 | 0.351 | surrogate, outcomes, outcome, hba, people | surrogate, hba, bone, surrogate_outcome, clinical |
| t_46 | field_experiment | 1.01 | 0.210 | transparency, line, lines, study, operators | transparency, operators, shifts, visibility, curtain |
| t_47 | perceptions_quality | 2.56 | 0.144 | status, quality, effect, award, boost | perceptions, boost, citations, perceptions_quality, status_quality |
| t_48 | single_unit | 1.89 | 0.115 | study, unit, units, single, research | study_research, single_unit, causal_proposition, france, study_method |
| t_49 | diversity_policies | 2.07 | 0.583 | unionization, diversity, effects, policies, union | unionization, diversity, union, workplaces, policies |
| t_50 | conceptual_stretching | 2.05 | 0.146 | category, categories, attributes, central, conceptual | category, radial, mother, radial_categories, secondary_categories |
| t_51 | abandon_prematurely | 2.33 | 0.178 | treatment, experiment, effect, effects, data | post, conducted, reduce, constraints, dummy |
| t_52 | wage_distribution | 1.00 | 0.495 | plant, wages, workers, jobs, study | plant, jobs, hires, maintenance, wages |
First, the labels. These are machine generated, high probabilty bigrams associated with the topics. Its easier to think of a topic in terms of a label than as a number (topic 1, topic 2). Think of this as the machine’s inductive open coding. I’m a big fan of recoding the topics in a way that makes sense to you. This may also be an occasion to assess how similar your interpretation of a topic is relative to others in your team.
Some definitions:
Prevalence: Pr (topic | corpus) or how “present” your topic is in the corpus. If I were to grab a document at random, how likely would I be to find this topic present? This column sums to 100.
Coherence: How distinct is a topic from all others. Ranges from 0 to 1. Roughly: Pr (Top phi terms | Topic \(\neq\) Top phi terms | Other Topics)
Phi: Pr (Term | Topic). Top terms are those more central to the meaning of the topic
Gamma: Pr (Topic | Term). Top terms are those most exclusive to the topic.
Now, we probably have a slightly better sense of what type of methods class this is. Some notes about topic distributions:
-Highly coherent topics are often rare topics. When many authors in many notes are writing about the same topic, the natural variation in their writing styles and note context makes the topic diffuse. In our corpus, these high coherent topics are often specifically addressed in a paper or chapter we were summarizing.
-High prevalence + low coherence topics are usually “corpus-level concerns” that concern all authors of a longitudinal multi-authored corpus. This being a methods class, these are usually issues of research design.
-Prevalence is related to volume of text pertaining to a topic. So topics covered in lengthier notes are more likely to have high prevalence
-Coherence is related to exclusivity of vocabulary. In highly coherent topics, there is overlap between central and exclusive vocabularies.
Now that we know what topics are present in the corpus, lets take advantage of a most wonderful output of a topic model, the document topic matrix. It is a Very Good Matrix. It is one row per document or note, and each row is a vector of probability that each of the topics in the corpus is present in the document. This vector sums to 1, so its giving you the % presence of each topic in a note. You can do A LOT with this matrix, both qualitatively and quantitatively.
We can use this matrix to look at how topics are related to each other,and how they are distributed over time and authors. This is a way of analyzing discourse.
doc_top<-as.data.frame(notes_ns_lda$theta) # theta is the document topic matrix in the LDA object that we created above.
doc_top$noteid<-rownames(doc_top)
notes_dim<-notes[,c(1,2)]
notes_dim$week<-substr(notes_dim$noteid,1,2)
doc_top<-merge(notes_dim,doc_top) # adding dimensions like noter and week to each document
#To cluster topics on how likely they are mutually present in a note or a reading:
doc_top.mat<-t(as.matrix(doc_top[,4:ncol(doc_top)]))
doc_top.dist<-JSD(doc_top.mat) #Jensen Shannon distance between the probability vector
doc_top.hclust<-hclust(as.dist(doc_top.dist),method="ward.D") #hclusts are data objects about the hierarchical relationship between items.
plot(doc_top.hclust,labels=paste(notes_ns_lda_sum$topic,notes_ns_lda_sum$label_1),cex=0.65,main="Topics clustered on co-occurence in document")
Topics that are closer to each other are more likely to be mutually present in the same document. Note that these dendrograms can “rotate” along their vertical axis, like a mobile over a child’s crib.
Now lets look at the distribution in prevalence of these topics across authors.
doc_top.l <- pivot_longer(doc_top,cols=starts_with("t_"),names_to="topic",values_to="prevalence") #converting the document topic matrix to a "tidy" form. This helps us aggregate the data quickly in various ways.
#annoying bug in textmineR - topics 1 through 9 dont have a leading 0.
doc_top.l$topic[doc_top.l$topic=="t_1"]<-"t_01"
doc_top.l$topic[doc_top.l$topic=="t_2"]<-"t_02"
doc_top.l$topic[doc_top.l$topic=="t_3"]<-"t_03"
doc_top.l$topic[doc_top.l$topic=="t_4"]<-"t_04"
doc_top.l$topic[doc_top.l$topic=="t_5"]<-"t_05"
doc_top.l$topic[doc_top.l$topic=="t_6"]<-"t_06"
doc_top.l$topic[doc_top.l$topic=="t_7"]<-"t_07"
doc_top.l$topic[doc_top.l$topic=="t_8"]<-"t_08"
doc_top.l$topic[doc_top.l$topic=="t_9"]<-"t_09"
#now cast back to "wide" for a heatmap
topic_noter<- doc_top.l %>%
group_by(topic,noter) %>%
summarize(mp=mean(prevalence)) %>%
pivot_wider(values_from=mp,names_from=topic)
#preparing data for the Heatmap
topic_noter.m<-as.matrix(t(topic_noter[,2:ncol(topic_noter)]))
rownames(topic_noter.m)<-paste(notes_ns_lda_sum$topic,notes_ns_lda_sum$label_1)
colnames(topic_noter.m)<-topic_noter$noter
colorpalette <- colorRampPalette(brewer.pal(9, "Greens"))(256) # this sets the color range for the heatmap
heatmap(topic_noter.m,scale="col",cexCol=0.8,cexRow=0.7,col=colorpalette)
Note the hierarchical clustering of rows and columns. Noters who are closer to each other have produced more similar notes. Topics that are closer to each other are used similarly by the noters. You can already see topics related to reading subject material, and general “class-level” concerns.
I wonder what this would look like if we had retained stop words. Would it look different?
Once we have a few more weeks of data we can look at time effects. How does the unfolding of the semester affect things like length of note, or emphasis on particular topics, and can we identify the intent of the professor in assigning a particular set of readings to a week?
To be continued…