This is a machine-assisted qualitative content analysis of class notes prepared by seven participants of a reading group for a Winter 2021 PhD Seminar at McGill University. Every week we each choose among the assigned readings and prepare summary notes for the rest of the group, to aid their reading. The texts of these notes are the raw data for this analysis.

Our stated goals are to use this analysis to explore content analysis methods and to look for semantic structure in our note-taking. This R vignette will capture the entire analysis pipeline. As with any halfway-decent machine-assisted qualitative analysis, it helps to have actually read the original texts.

For the lay reader: the nature of our PhD seminar, the intellectual proclivities of our professor, and our own linguistic tendencies will be made manifest in our analysis of the text.

#preamble
rm(list=ls()) #clear memory
setwd("~/Documents/OneDrive - McGill University/R/Projects/notes710") #working directory

#required packages
library(knitr) #publishing this notebook
library(gsheet) #accessing raw data from google sheets
library(tidyverse) #data manipulation
library(data.table) #data manipulation
library(tidytext) #text manipulation
library(textmineR) #topic modeling
library(philentropy) #clustering topics
library(ggplot2) #plot
library(RColorBrewer) #plot

The first step in this process is cleaning and arranging the raw data, the texts. I have pasted the text from the notes into a google sheet (sans formatting), one row per document. Conceptually, each reading summary written by each note-taker constitutes one “document” in our corpus. There is information in formatting. For example, whether an author prefers to use bullets or prose, or how they organize text into paragraphs tells us something about them. In this analysis, however, we do not consider formatting.

An inspection of the data reveals that some of the readings are related, so we may expect notes about related readings to be more similar to each other than to other notes in some ways. We may also expect notes written by the same noter to be more similar in other ways to each other than notes written by different noters.

Now, import it all into R.

data1<-gsheet2tbl('docs.google.com/spreadsheets/d/1c2eWivCgCpU74-U0paQXfWAWNKo_O1euxua7JlXbJWo') #use the sharing/viewing url for the sheet here
print(data1,n=7) #print first 7 rows
## # A tibble: 48 x 7
##   noteid noter author       year title          selection  notes_noformat       
##   <chr>  <chr> <chr>       <dbl> <chr>          <chr>      <chr>                
## 1 04YJL  a1    Sartori, G.  1970 "Concept misf… 1033-1053  "Introduction A cons…
## 2 04AGR  a2    Goertz, G.…  2012 "Concepts, De… Ch10       "Overview The chapte…
## 3 04AB   a3    Prasad, V.…  2015 "Surrogate Ou… Ch3        "Context-Free Takeaw…
## 4 04MK   a4    Shadish, W…  2002 "Construct Va… pp. 64-82… "Relationship: The c…
## 5 04JS   a5    Shadish, W…  2002 "Construct Va… second 9   "Threats to Construc…
## 6 04JN   a6    Grigoryeva…  2015 "The historic… All        "In this article, th…
## 7 04AMB  a7    Collier, D…  1993 "Conceptual\"… All        "This discussion pap…
## # … with 41 more rows

Next, we tokenize this text. This means we’re going to transform the data into one row per word. This helps us do some basic counting and cleaning. We’ll reconstitute it back into one row per document before modeling it. Note that you can choose to tokenize by bigram (two words per row), trigram (three words per row), line, sentence, paragraph, etc depending on your unit of analysis. Because we disaggregate texts like this, it is critical that you have a unique identifier for each document. I like my unique identifiers to also be informative. Here, noteid is informative of the “week” of class and the author of the note.

This “only one thing per row” concept is core to the Tidy ’verse. We’ll be using a lot of Tidy methods here. Tidy is a disciplined way of thinking about data architecture that facilitates a relatively simple but versatile programming syntax that allows us to ask a huge number of empirical research questions. We will of course, frequently make the data unTidy for various practical reasons.

This is what the data looks like now. All punctuation and capitalization is now stripped out.

notes<-data1[,c(1,2,7)] # I only want NoteID, Noter, and Text
notes_unnested<-notes %>%
  unnest_tokens(word,notes_noformat) #unnest column notes_noformat into words.
print(notes_unnested,n=15)
## # A tibble: 52,587 x 3
##    noteid noter word        
##    <chr>  <chr> <chr>       
##  1 04YJL  a1    introduction
##  2 04YJL  a1    a           
##  3 04YJL  a1    conscious   
##  4 04YJL  a1    thinker     
##  5 04YJL  a1    is          
##  6 04YJL  a1    a           
##  7 04YJL  a1    person      
##  8 04YJL  a1    who         
##  9 04YJL  a1    is          
## 10 04YJL  a1    aware       
## 11 04YJL  a1    of          
## 12 04YJL  a1    the         
## 13 04YJL  a1    assumptions 
## 14 04YJL  a1    and         
## 15 04YJL  a1    implications
## # … with 52,572 more rows

Note the information around the data. This is a “tibble” with certain number of rows x columns. This is a one word per row table so that means there are “number of rows of the notes_unnested table” words in our corpus.

For the number of unique words in the corpus, look at the row count of the notes_words table.

notes_words<-notes_unnested %>% # in the notes_unnested table,
  count(word,sort=TRUE) # count how many times each word appears

print(notes_words,n=15)
## # A tibble: 5,938 x 2
##    word          n
##    <chr>     <int>
##  1 the        3212
##  2 of         2048
##  3 to         1441
##  4 and        1180
##  5 a          1015
##  6 in          990
##  7 is          743
##  8 that        591
##  9 for         487
## 10 are         418
## 11 on          391
## 12 be          359
## 13 with        347
## 14 not         310
## 15 treatment   304
## # … with 5,923 more rows

Now we can start looking at things like most frequent words by noter.

n<-10
noter_topw<-notes_unnested %>% #in the notes_unnested table,
  count(noter,word,sort=TRUE) %>% #count unique noter-word combinations,
  group_by(noter) %>% 
  top_n(10) %>% #and show me top "n" words used by noter
  arrange(noter)

#make the table "wide" so its easier to read.
noter_topw_w<-data.frame(matrix(0,nrow=n,ncol=length(unique(noter_topw$noter))))
for(i in 1: length(unique(noter_topw$noter))) {
noter_topw_w[,i]<-head(noter_topw[noter_topw$noter==unique(noter_topw$noter)[i],2],n)
} 
names(noter_topw_w)<-unique(noter_topw$noter)
head(noter_topw_w,n=n)
## # A tibble: 10 x 7
##    a1     a2    a3    a4    a5    a6    a7   
##    <chr>  <chr> <chr> <chr> <chr> <chr> <chr>
##  1 the    of    the   the   the   the   the  
##  2 of     the   of    of    of    of    of   
##  3 to     and   a     to    to    to    to   
##  4 and    to    to    and   and   a     a    
##  5 a      a     is    in    in    is    and  
##  6 in     in    and   a     a     in    in   
##  7 is     is    in    for   that  and   is   
##  8 are    that  that  is    is    that  that 
##  9 be     with  for   that  for   case  are  
## 10 causal on    or    on    be    study p

You’ll notice a lot of words like “the” and “of”. Lets remove them. We’ll use a pre-existing lexicon of “stop words” to remove them from our corpus, and redo the top word exercise. Some of our semantic “personality” is likely reflected in how we use these words, so we’re losing some of that information when we remove stop words.

data(stop_words) #load lexicon
notes_unnested_ns <- notes_unnested %>%
  anti_join(stop_words)

n<-10
noter_topw_ns<-notes_unnested_ns %>%
  count(noter,word,sort=TRUE) %>%
  group_by(noter) %>%
  top_n(10) %>%
  arrange(noter)

noter_topw_w_ns<-data.frame(matrix(0,nrow=n,ncol=length(unique(noter_topw_ns$noter))))
for(i in 1: length(unique(noter_topw_ns$noter))) {
noter_topw_w_ns[,i]<-head(noter_topw_ns[noter_topw_ns$noter==unique(noter_topw_ns$noter)[i],2],n)
} 
names(noter_topw_w_ns)<-unique(noter_topw_ns$noter)
head(noter_topw_w_ns,n=n)
## # A tibble: 10 x 7
##    a1          a2            a3           a4       a5       a6         a7       
##    <chr>       <chr>         <chr>        <chr>    <chr>    <chr>      <chr>    
##  1 causal      status        bias         treatme… treatme… study      experime…
##  2 treatment   concept       causal       causal   referred treatment  field    
##  3 selection   unionization  research     control  workers  bribery    category 
##  4 slaves      research      outcome      analysis 2        research   categori…
##  5 variable    data          variable     albums   job      unit       subjects 
##  6 theory      effects       effect       effect   analysis causal     populati…
##  7 research    qual          treatment    assigned causal   baseline   effect   
##  8 slave       ietf          selection    study    control  intervent… experime…
##  9 observatio… organization… observation… data     model    researche… e.g      
## 10 variables   quality       outcomes     outcome  sample   i.e        1

Note: cleaning your analysis corpus “automagically” has some limitations. Words like “cases” and different" get cleaned out. I’m not sure I agree that these are meaningless “filler” words. But the alternative is to manually remove stop words one by one, and that can get very tedious. So “anti-joining” is an easy option. Anti-joining A and B means removing all elements of A \(\bigcap\) B from A.

Lets also check what the stop word removal did to the size of our corpus.

#wordcount
dim(notes_unnested_ns)[1] #number of rows. For columns, [2]
## [1] 26149
#unique words
notes_words_ns<-notes_unnested_ns %>% 
  count(word,sort=TRUE) 
dim(notes_words_ns)[1] 
## [1] 5413

Are the top words of each noter, stripped of stop words, informative of what we are writing “about?” What can we infer about the sort of class that the noters are taking? Looks pretty clear at this point that its a methods class. However, because we are talking about multiple authors, multiple documents per author, something aggregative at the level of the whole corpus might be more useful.

So now, we turn to LDA topic modeling. I highly recommend at this point you first go read this. If you’re feeling feisty, try the Wikipedia articles for Distributional Semantics and Latent Dirichlet Allocation.

Ok, all done? Now lets proceed. I’m going to work with the “no stop word” version of the corpus, so first I need to reconstitute it back into one row per document.

notes_reconst_ns<-notes_unnested_ns[,] %>%
  group_by(noteid) %>% #one row per note
  mutate(ind=row_number()) %>%
  tidyr::spread(key=ind,value=word) # this creates a column for each word in each note
notes_reconst_ns[is.na(notes_reconst_ns)] <-""
notes_reconst_ns<-tidyr::unite(notes_reconst_ns,notes_ns,-c("noteid","noter"),sep=" ",remove=T) #stitch one-word columns together

print(notes_reconst_ns,n=7) #print first 7 rows
## # A tibble: 47 x 3
## # Groups:   noteid [47]
##   noteid noter notes_ns                                                         
##   <chr>  <chr> <chr>                                                            
## 1 04AB   a3    "context free takeaway elements causal mechanisms hard measure r…
## 2 04AGR  a2    "overview chapter quantitative qualitative approaches conceptual…
## 3 04AMB  a7    "discussion paper offers guidelines challenges comparative analy…
## 4 04JN   a6    "article authors propose measure residential segregation measure…
## 5 04JS   a5    "threats construct validity issues construct validity occur lack…
## 6 04MK   a4    "relationship condition related connection association connectio…
## 7 04YJL  a1    "introduction conscious thinker person aware assumptions implica…
## # … with 40 more rows

Now, to create a document term matrix. This is a sparse matrix (lots of blanks) with one row per document and one column per “term.” At its most basic, a term is a word. For our analysis, I’ll be using both words and bigrams (two-word combinations).

To illustrate, the phrase “the quick brown fox” consists of the terms the + quick + brown + fox + the quick + quick brown + brown fox. Note that we have stripped our text of stopwords, so some of these bigrams may not accurately reflect their actual use.

notes_ns_dtm <- CreateDtm(notes_reconst_ns$notes_ns, 
                    doc_names = notes_reconst_ns$noteid, 
                    ngram_window = c(1, 2))

Now, for an important step. LDA operates under the assumption that any corpus of documents can be described as a mixture of some pre-determined number of topics or themes. So each document can be described using a vector of probabilities. However, the length of that vector, a “correct” number of topics “k” needs to pre-specified. There are many ways at coming up with some “best” number of topics.

I choose to optimize for semantic coherence, a measure of how distinct the meaning of one topic or theme is from the other topics present in the field. Partly because this makes sense as something to optimize for topics, and also because the developer of the textmineR package has spent a lot of time thinking about coherence, and is very thoughtful in the implementation. We are looking for the number of topics that maximizes the average coherence of topics, so we need to search through some range of parameter space.

So where do we start? One way to do it, when you’re new to a type of corpus, is to start low, around 5 topics, and cast a wide net, all the way up to around 50 topics. Over time, you get a feel for the range of topics in a particular type of corpus, and then you can start with a narrower range. This is important, because the search process can be very time consuming for large corpora. You’re basically estimating an LDA topic model of your text with 5 topics, then 6 topics, then 7, and so on, and for large corpora, each estimation may take a lot of time. Here, I start at 40 topics because I know based on prior runs that sub-40 topics models have low coherence with this dataset.

coh<-data.frame(k=seq(40,60,by=1),coh=0)

#create a vector to store the results of our search 
for (i in 1:dim(coh)[1]) {
set.seed(12345)
notes_ns_lda<-FitLdaModel(dtm = notes_ns_dtm, k = coh[i,1], iterations = 500)
coh[i,2]<-mean(notes_ns_lda$coherence)
rm(notes_ns_lda)
}

plot(coh$k,coh$coh,type="l",xlab="Number of Topics",ylab="Average Coherence of Topics")

Looks like we have a winner. Now lets look at a summary of topics in our corpus

k<-coh[coh$coh==max(coh$coh),1]
set.seed(12345)
notes_ns_lda<-FitLdaModel(dtm = notes_ns_dtm, k = k, iterations = 5000) # more iterations for the "final" run
notes_ns_lda_sum<-SummarizeTopics(notes_ns_lda)
kable(notes_ns_lda_sum)
topic label_1 prevalence coherence top_terms_phi top_terms_gamma
t_1 status_author 1.50 0.297 status, author, ietf, publication, al ietf, publication, chair, signals, id
t_2 treatment_control 1.67 0.108 cluster, means, variance, standard, treatment_control cluster, clusters, fisher, variances, standard_error
t_3 residential_segregation 2.39 0.441 segregation, measure, validity, residential, census segregation, residential, households, residential_segregation, spatial
t_4 grounded_theory 3.16 0.210 theory, theoretical, grounded, grounded_theory, strauss grounded, strauss, grounded_theory, glaser_strauss, glaser
t_5 concept_definition 1.96 0.500 concept, qual, indicators, quant, measurement qual, indicators, quant, concept_definition, concept
t_6 causal_inference 2.26 0.235 inference, causal_inference, causal, unit, units causation, joint, exposure, autor, exposure_variables
t_7 qualitative_researchers 2.96 0.110 researchers, qualitative, sampling, quantitative, studies quantitative_researchers, unique, sampling_methods, statistically_representative, based_studies
t_8 regression_discontinuity 0.94 0.085 regression, subjects, study, threshold, compliers compliers, discontinuity_designs, crossover, units_assigned, fuzzy_regression
t_9 retrospectively_consecrated 1.04 0.970 album, consecration, consecrated, music, critics album, consecration, consecrated, music, professional
t_10 treatment_control 1.77 0.057 treatment, construct, participants, constructs, control participants, innovation, administrators, assess_presence, combining
t_11 abandon_prematurely 3.35 0.206 causal, effect, outcome, potential, causal_effect potential, causal_effect, causal_effects, outcome_treatment, outcomes_treatment
t_12 trial_sample 1.65 0.220 randomization, sample, rct, treatment, trial rct, trial, rcts, ate, perfect
t_13 regression_discontinuity 1.44 0.167 treatment, assigned, control, units, regression_discontinuity assigned_control, regression_discontinuity, instrumental_variables, assigned_treatment, instrumental
t_14 sign_object 1.75 0.621 abduction, induction, deduction, observations, theories abduction, deduction, induction, peirce, abductive
t_15 selective_sampling 2.22 0.187 sampling, parameter, selective, populations, time selective, adopters, selective_sampling, populations, beta
t_16 field_experiments 1.27 0.099 experiments, field, subjects, lab, field_experiments field_experiments, lab, social_experiments, substitutes, naturally
t_17 construct_validity 2.18 0.220 construct, validity, construct_validity, constructs, match construct_validity, match, particulars, prototypical, sampling_particulars
t_18 free_spaces 0.96 0.602 change, spaces, reformers, advent, bayshore spaces, reformers, advent, bayshore, defenders
t_19 dependent_variable 1.69 0.178 variable, dependent, observations, dependent_variable, selection dependent_variable, explanatory_variables, explanatory, dependent, indeterminate
t_20 sampling_error 1.14 0.397 sna, analysis, lna, model, sample sna, lna, mb, mb_sna, se
t_21 target_population 2.07 0.084 population, sample, person, target, network network, chain, target_population, seeds, target
t_22 method_agreement 2.42 0.088 causal, independent, method, methods, variable agreement, mill, method_agreement, method_difference, variable_independent
t_23 selection_bias 2.32 0.189 bias, selection, selection_bias, outcome, endogenous_selection selection_bias, endogenous_selection, bias, successful, offers
t_24 wage_distribution 1.01 0.285 job, wage, change, time, workers wage, retooling, job, skill, percentile
t_25 silver_blaze 2.01 0.216 tests, test, evidence, analysis, hypothesis process_tracing, tracing, tests, tannenwald, blaze
t_26 slave_trades 1.92 0.979 slaves, slave, slave_trades, trades, africa slaves, slave, slave_trades, trades, africa
t_27 free_spaces 0.89 0.481 programs, residents, interns, oppositional, night interns, oppositional, residents, programs, relational
t_28 potential_outcome 2.88 0.160 treatment, individual, control, potential_outcome, population catholic, yi_yi, constitutive, constitutive_features, effect_hospitalization
t_29 abandon_prematurely 0.89 0.192 information, task, time, referrals_performance, pieces referrals_performance, monitoring_treatment, turnover, job_referred, psa
t_30 people_identify 2.87 0.121 context, people, ethnographic, develop, identify audiences, fieldwork, financial, ethnographic, discerning
t_31 potential_outcome 2.43 0.036 variables, researcher, sample, relationship, yi block, block_door, computing, iv_dv, quasi
t_32 gram_panchayats 2.06 0.645 women, gram, panchayats, gram_panchayats, reserved gram, panchayats, gram_panchayats, reserved, council
t_33 regulatory_compliance 2.04 0.300 compliance, organizational, organizations, research, regulation compliance, legal, regulator, regulatory, regulatory_compliance
t_34 civil_society 2.15 0.600 local, schlesinger, voluntary, civic, government schlesinger, voluntary, civic, civil_society, classic
t_35 healthcare_providers 1.66 0.486 bribery, baseline, intervention, threat, reputational bribery, healthcare, healthcare_providers, providers, reputational_threat
t_36 ladder_abstraction 2.98 0.189 empirical, conceptual, concepts, political, extension abstraction, ladder_abstraction, universals, extension, relevance_potential
t_37 observational_research 2.71 0.131 research, observational, experimental, observational_research, relationship observational_research, radical, skepticism, experimental_research, radical_skepticism
t_38 treatment_assignment 2.74 0.279 natural, treatment, experiments, natural_experiments, assignment treatment_assignment, cpos_information, confounders, natural_experiments, causal_statistical
t_39 directed_graphs 2.38 0.156 variable, causal, graphs, path, directed conditioning, graph, directed, graphs, path
t_40 abandon_prematurely 2.38 0.099 causal, process, qualitative, variable, research causal_process, data_set, events, level_analysis, focus
t_41 smoking_gun 1.41 0.703 cpos, gun, smoking_gun, observations, smoking cpos, gun, smoking_gun, smoking, hoop_test
t_42 referred_workers 1.29 0.249 referred, performance, workers, referrers, referred_workers referred, referrers, referred_workers, referrals, referred_referred
t_43 field_experiment 1.00 0.291 performance, privacy, data, organizational, observability privacy, observability, organizational_learning, paradox, actions
t_44 retrospectively_consecrated 1.14 0.810 albums, recognition, popular, film, cultural popular, film, recognition, odds, albums
t_45 surrogate_outcome 2.13 0.351 surrogate, outcomes, outcome, hba, people surrogate, hba, bone, surrogate_outcome, clinical
t_46 field_experiment 1.01 0.210 transparency, line, lines, study, operators transparency, operators, shifts, visibility, curtain
t_47 perceptions_quality 2.56 0.144 status, quality, effect, award, boost perceptions, boost, citations, perceptions_quality, status_quality
t_48 single_unit 1.89 0.115 study, unit, units, single, research study_research, single_unit, causal_proposition, france, study_method
t_49 diversity_policies 2.07 0.583 unionization, diversity, effects, policies, union unionization, diversity, union, workplaces, policies
t_50 conceptual_stretching 2.05 0.146 category, categories, attributes, central, conceptual category, radial, mother, radial_categories, secondary_categories
t_51 abandon_prematurely 2.33 0.178 treatment, experiment, effect, effects, data post, conducted, reduce, constraints, dummy
t_52 wage_distribution 1.00 0.495 plant, wages, workers, jobs, study plant, jobs, hires, maintenance, wages

First, the labels. These are machine generated, high probabilty bigrams associated with the topics. Its easier to think of a topic in terms of a label than as a number (topic 1, topic 2). Think of this as the machine’s inductive open coding. I’m a big fan of recoding the topics in a way that makes sense to you. This may also be an occasion to assess how similar your interpretation of a topic is relative to others in your team.

Some definitions:

Prevalence: Pr (topic | corpus) or how “present” your topic is in the corpus. If I were to grab a document at random, how likely would I be to find this topic present? This column sums to 100.

Coherence: How distinct is a topic from all others. Ranges from 0 to 1. Roughly: Pr (Top phi terms | Topic \(\neq\) Top phi terms | Other Topics)

Phi: Pr (Term | Topic). Top terms are those more central to the meaning of the topic

Gamma: Pr (Topic | Term). Top terms are those most exclusive to the topic.

Now, we probably have a slightly better sense of what type of methods class this is. Some notes about topic distributions:

-Highly coherent topics are often rare topics. When many authors in many notes are writing about the same topic, the natural variation in their writing styles and note context makes the topic diffuse. In our corpus, these high coherent topics are often specifically addressed in a paper or chapter we were summarizing.

-High prevalence + low coherence topics are usually “corpus-level concerns” that concern all authors of a longitudinal multi-authored corpus. This being a methods class, these are usually issues of research design.

-Prevalence is related to volume of text pertaining to a topic. So topics covered in lengthier notes are more likely to have high prevalence

-Coherence is related to exclusivity of vocabulary. In highly coherent topics, there is overlap between central and exclusive vocabularies.

Now that we know what topics are present in the corpus, lets take advantage of a most wonderful output of a topic model, the document topic matrix. It is a Very Good Matrix. It is one row per document or note, and each row is a vector of probability that each of the topics in the corpus is present in the document. This vector sums to 1, so its giving you the % presence of each topic in a note. You can do A LOT with this matrix, both qualitatively and quantitatively.

We can use this matrix to look at how topics are related to each other,and how they are distributed over time and authors. This is a way of analyzing discourse.

doc_top<-as.data.frame(notes_ns_lda$theta) # theta is the document topic matrix in the LDA object that we created above.
doc_top$noteid<-rownames(doc_top)
notes_dim<-notes[,c(1,2)]
notes_dim$week<-substr(notes_dim$noteid,1,2)
doc_top<-merge(notes_dim,doc_top) # adding dimensions like noter and week to each document

#To cluster topics on how likely they are mutually present in a note or a reading:
doc_top.mat<-t(as.matrix(doc_top[,4:ncol(doc_top)]))
doc_top.dist<-JSD(doc_top.mat) #Jensen Shannon distance between the probability vector
doc_top.hclust<-hclust(as.dist(doc_top.dist),method="ward.D") #hclusts are data objects about the hierarchical relationship between items. 
plot(doc_top.hclust,labels=paste(notes_ns_lda_sum$topic,notes_ns_lda_sum$label_1),cex=0.65,main="Topics clustered on co-occurence in document")

Topics that are closer to each other are more likely to be mutually present in the same document. Note that these dendrograms can “rotate” along their vertical axis, like a mobile over a child’s crib.

Now lets look at the distribution in prevalence of these topics across authors.

doc_top.l <- pivot_longer(doc_top,cols=starts_with("t_"),names_to="topic",values_to="prevalence") #converting the document topic matrix to a "tidy" form. This helps us aggregate the data quickly in various ways.

#annoying bug in textmineR - topics 1 through 9 dont have a leading 0.
doc_top.l$topic[doc_top.l$topic=="t_1"]<-"t_01"
doc_top.l$topic[doc_top.l$topic=="t_2"]<-"t_02"
doc_top.l$topic[doc_top.l$topic=="t_3"]<-"t_03"
doc_top.l$topic[doc_top.l$topic=="t_4"]<-"t_04"
doc_top.l$topic[doc_top.l$topic=="t_5"]<-"t_05"
doc_top.l$topic[doc_top.l$topic=="t_6"]<-"t_06"
doc_top.l$topic[doc_top.l$topic=="t_7"]<-"t_07"
doc_top.l$topic[doc_top.l$topic=="t_8"]<-"t_08"
doc_top.l$topic[doc_top.l$topic=="t_9"]<-"t_09"

#now cast back to "wide" for a heatmap
topic_noter<-  doc_top.l %>%
  group_by(topic,noter) %>%
  summarize(mp=mean(prevalence)) %>%
  pivot_wider(values_from=mp,names_from=topic)

#preparing data for the Heatmap
topic_noter.m<-as.matrix(t(topic_noter[,2:ncol(topic_noter)]))
rownames(topic_noter.m)<-paste(notes_ns_lda_sum$topic,notes_ns_lda_sum$label_1)
colnames(topic_noter.m)<-topic_noter$noter
colorpalette <- colorRampPalette(brewer.pal(9, "Greens"))(256) # this sets the color range for the heatmap

heatmap(topic_noter.m,scale="col",cexCol=0.8,cexRow=0.7,col=colorpalette)

Note the hierarchical clustering of rows and columns. Noters who are closer to each other have produced more similar notes. Topics that are closer to each other are used similarly by the noters. You can already see topics related to reading subject material, and general “class-level” concerns.

I wonder what this would look like if we had retained stop words. Would it look different?

Once we have a few more weeks of data we can look at time effects. How does the unfolding of the semester affect things like length of note, or emphasis on particular topics, and can we identify the intent of the professor in assigning a particular set of readings to a week?

To be continued…