Notes 710

This is a machine-assisted qualitative content analysis of class notes prepared by seven participants of a reading group for a Winter 2021 PhD Seminar at McGill University. Every week we each choose among the assigned readings and prepare summary notes for the rest of the group, to aid their reading. The texts of these notes are the raw data for this analysis.

Our stated goals are to use this analysis to explore content analysis methods and to look for semantic structure in our note-taking. This R vignette will capture the entire analysis pipeline. As with any halfway-decent machine-assisted qualitative analysis, it helps to have actually read the original texts.

For the lay reader: the nature of our PhD seminar, the intellectual proclivities of our professor, and our own linguistic tendencies will be made manifest in our analysis of the text.

#preamble
rm(list=ls()) #clear memory
setwd("~/Documents/OneDrive - McGill University/R/Projects/notes710") #working directory

#required packages
library(knitr) #publishing this notebook
library(gsheet) #accessing raw data from google sheets
library(tidyverse) #data manipulation
library(data.table) #data manipulation
library(tidytext) #text manipulation
library(textmineR) #topic modeling
library(philentropy) #clustering topics
library(ggplot2) #plot
library(RColorBrewer) #plot

The first step in this process is cleaning and arranging the raw data, the texts. I have pasted the text from the notes into a google sheet (sans formatting), one row per document. Conceptually, each reading summary written by each note-taker constitutes one “document” in our corpus. There is information in formatting. For example, whether an author prefers to use bullets or prose, or how they organize text into paragraphs tells us something about them. In this analysis, however, we do not consider formatting.

An inspection of the data reveals that some of the readings are related, so we may expect notes about related readings to be more similar to each other than to other notes in some ways. We may also expect notes written by the same noter to be more similar in other ways to each other than notes written by different noters.

Now, import it all into R.

data1<-gsheet2tbl('docs.google.com/spreadsheets/d/1c2eWivCgCpU74-U0paQXfWAWNKo_O1euxua7JlXbJWo') #use the sharing/viewing url for the sheet here
print(data1,n=7) #print first 7 rows

## # A tibble: 48 x 7
##   noteid noter author       year title          selection  notes_noformat       
##   <chr>  <chr> <chr>       <dbl> <chr>          <chr>      <chr>                
## 1 04YJL  a1    Sartori, G.  1970 "Concept misf… 1033-1053  "Introduction A cons…
## 2 04AGR  a2    Goertz, G.…  2012 "Concepts, De… Ch10       "Overview The chapte…
## 3 04AB   a3    Prasad, V.…  2015 "Surrogate Ou… Ch3        "Context-Free Takeaw…
## 4 04MK   a4    Shadish, W…  2002 "Construct Va… pp. 64-82… "Relationship: The c…
## 5 04JS   a5    Shadish, W…  2002 "Construct Va… second 9   "Threats to Construc…
## 6 04JN   a6    Grigoryeva…  2015 "The historic… All        "In this article, th…
## 7 04AMB  a7    Collier, D…  1993 "Conceptual\"… All        "This discussion pap…
## # … with 41 more rows

Next, we tokenize this text. This means we’re going to transform the data into one row per word. This helps us do some basic counting and cleaning. We’ll reconstitute it back into one row per document before modeling it. Note that you can choose to tokenize by bigram (two words per row), trigram (three words per row), line, sentence, paragraph, etc depending on your unit of analysis. Because we disaggregate texts like this, it is critical that you have a unique identifier for each document. I like my unique identifiers to also be informative. Here, noteid is informative of the “week” of class and the author of the note.

This “only one thing per row” concept is core to the Tidy ’verse. We’ll be using a lot of Tidy methods here. Tidy is a disciplined way of thinking about data architecture that facilitates a relatively simple but versatile programming syntax that allows us to ask a huge number of empirical research questions. We will of course, frequently make the data unTidy for various practical reasons.

This is what the data looks like now. All punctuation and capitalization is now stripped out.

notes<-data1[,c(1,2,7)] # I only want NoteID, Noter, and Text
notes_unnested<-notes %>%
  unnest_tokens(word,notes_noformat) #unnest column notes_noformat into words.
print(notes_unnested,n=15)

## # A tibble: 52,587 x 3
##    noteid noter word        
##    <chr>  <chr> <chr>       
##  1 04YJL  a1    introduction
##  2 04YJL  a1    a           
##  3 04YJL  a1    conscious   
##  4 04YJL  a1    thinker     
##  5 04YJL  a1    is          
##  6 04YJL  a1    a           
##  7 04YJL  a1    person      
##  8 04YJL  a1    who         
##  9 04YJL  a1    is          
## 10 04YJL  a1    aware       
## 11 04YJL  a1    of          
## 12 04YJL  a1    the         
## 13 04YJL  a1    assumptions 
## 14 04YJL  a1    and         
## 15 04YJL  a1    implications
## # … with 52,572 more rows

Note the information around the data. This is a “tibble” with certain number of rows x columns. This is a one word per row table so that means there are “number of rows of the notes_unnested table” words in our corpus.

For the number of unique words in the corpus, look at the row count of the notes_words table.

notes_words<-notes_unnested %>% # in the notes_unnested table,
  count(word,sort=TRUE) # count how many times each word appears

print(notes_words,n=15)

## # A tibble: 5,938 x 2
##    word          n
##    <chr>     <int>
##  1 the        3212
##  2 of         2048
##  3 to         1441
##  4 and        1180
##  5 a          1015
##  6 in          990
##  7 is          743
##  8 that        591
##  9 for         487
## 10 are         418
## 11 on          391
## 12 be          359
## 13 with        347
## 14 not         310
## 15 treatment   304
## # … with 5,923 more rows

Now we can start looking at things like most frequent words by noter.

n<-10
noter_topw<-notes_unnested %>% #in the notes_unnested table,
  count(noter,word,sort=TRUE) %>% #count unique noter-word combinations,
  group_by(noter) %>% 
  top_n(10) %>% #and show me top "n" words used by noter
  arrange(noter)

#make the table "wide" so its easier to read.
noter_topw_w<-data.frame(matrix(0,nrow=n,ncol=length(unique(noter_topw$noter))))
for(i in 1: length(unique(noter_topw$noter))) {
noter_topw_w[,i]<-head(noter_topw[noter_topw$noter==unique(noter_topw$noter)[i],2],n)
} 
names(noter_topw_w)<-unique(noter_topw$noter)
head(noter_topw_w,n=n)

## # A tibble: 10 x 7
##    a1     a2    a3    a4    a5    a6    a7   
##    <chr>  <chr> <chr> <chr> <chr> <chr> <chr>
##  1 the    of    the   the   the   the   the  
##  2 of     the   of    of    of    of    of   
##  3 to     and   a     to    to    to    to   
##  4 and    to    to    and   and   a     a    
##  5 a      a     is    in    in    is    and  
##  6 in     in    and   a     a     in    in   
##  7 is     is    in    for   that  and   is   
##  8 are    that  that  is    is    that  that 
##  9 be     with  for   that  for   case  are  
## 10 causal on    or    on    be    study p

You’ll notice a lot of words like “the” and “of”. Lets remove them. We’ll use a pre-existing lexicon of “stop words” to remove them from our corpus, and redo the top word exercise. Some of our semantic “personality” is likely reflected in how we use these words, so we’re losing some of that information when we remove stop words.

data(stop_words) #load lexicon
notes_unnested_ns <- notes_unnested %>%
  anti_join(stop_words)

n<-10
noter_topw_ns<-notes_unnested_ns %>%
  count(noter,word,sort=TRUE) %>%
  group_by(noter) %>%
  top_n(10) %>%
  arrange(noter)

noter_topw_w_ns<-data.frame(matrix(0,nrow=n,ncol=length(unique(noter_topw_ns$noter))))
for(i in 1: length(unique(noter_topw_ns$noter))) {
noter_topw_w_ns[,i]<-head(noter_topw_ns[noter_topw_ns$noter==unique(noter_topw_ns$noter)[i],2],n)
} 
names(noter_topw_w_ns)<-unique(noter_topw_ns$noter)
head(noter_topw_w_ns,n=n)

## # A tibble: 10 x 7
##    a1          a2            a3           a4       a5       a6         a7       
##    <chr>       <chr>         <chr>        <chr>    <chr>    <chr>      <chr>    
##  1 causal      status        bias         treatme… treatme… study      experime…
##  2 treatment   concept       causal       causal   referred treatment  field    
##  3 selection   unionization  research     control  workers  bribery    category 
##  4 slaves      research      outcome      analysis 2        research   categori…
##  5 variable    data          variable     albums   job      unit       subjects 
##  6 theory      effects       effect       effect   analysis causal     populati…
##  7 research    qual          treatment    assigned causal   baseline   effect   
##  8 slave       ietf          selection    study    control  intervent… experime…
##  9 observatio… organization… observation… data     model    researche… e.g      
## 10 variables   quality       outcomes     outcome  sample   i.e        1

Note: cleaning your analysis corpus “automagically” has some limitations. Words like “cases” and different" get cleaned out. I’m not sure I agree that these are meaningless “filler” words. But the alternative is to manually remove stop words one by one, and that can get very tedious. So “anti-joining” is an easy option. Anti-joining A and B means removing all elements of A \(\bigcap\) B from A.

Lets also check what the stop word removal did to the size of our corpus.

#wordcount
dim(notes_unnested_ns)[1] #number of rows. For columns, [2]

## [1] 26149

#unique words
notes_words_ns<-notes_unnested_ns %>% 
  count(word,sort=TRUE) 
dim(notes_words_ns)[1]

## [1] 5413

Are the top words of each noter, stripped of stop words, informative of what we are writing “about?” What can we infer about the sort of class that the noters are taking? Looks pretty clear at this point that its a methods class. However, because we are talking about multiple authors, multiple documents per author, something aggregative at the level of the whole corpus might be more useful.

So now, we turn to LDA topic modeling. I highly recommend at this point you first go read this. If you’re feeling feisty, try the Wikipedia articles for Distributional Semantics and Latent Dirichlet Allocation.

Ok, all done? Now lets proceed. I’m going to work with the “no stop word” version of the corpus, so first I need to reconstitute it back into one row per document.

notes_reconst_ns<-notes_unnested_ns[,] %>%
  group_by(noteid) %>% #one row per note
  mutate(ind=row_number()) %>%
  tidyr::spread(key=ind,value=word) # this creates a column for each word in each note
notes_reconst_ns[is.na(notes_reconst_ns)] <-""
notes_reconst_ns<-tidyr::unite(notes_reconst_ns,notes_ns,-c("noteid","noter"),sep=" ",remove=T) #stitch one-word columns together

print(notes_reconst_ns,n=7) #print first 7 rows

## # A tibble: 47 x 3
## # Groups:   noteid [47]
##   noteid noter notes_ns                                                         
##   <chr>  <chr> <chr>                                                            
## 1 04AB   a3    "context free takeaway elements causal mechanisms hard measure r…
## 2 04AGR  a2    "overview chapter quantitative qualitative approaches conceptual…
## 3 04AMB  a7    "discussion paper offers guidelines challenges comparative analy…
## 4 04JN   a6    "article authors propose measure residential segregation measure…
## 5 04JS   a5    "threats construct validity issues construct validity occur lack…
## 6 04MK   a4    "relationship condition related connection association connectio…
## 7 04YJL  a1    "introduction conscious thinker person aware assumptions implica…
## # … with 40 more rows

Now, to create a document term matrix. This is a sparse matrix (lots of blanks) with one row per document and one column per “term.” At its most basic, a term is a word. For our analysis, I’ll be using both words and bigrams (two-word combinations).

To illustrate, the phrase “the quick brown fox” consists of the terms the + quick + brown + fox + the quick + quick brown + brown fox. Note that we have stripped our text of stopwords, so some of these bigrams may not accurately reflect their actual use.

notes_ns_dtm <- CreateDtm(notes_reconst_ns$notes_ns, 
                    doc_names = notes_reconst_ns$noteid, 
                    ngram_window = c(1, 2))

Now, for an important step. LDA operates under the assumption that any corpus of documents can be described as a mixture of some pre-determined number of topics or themes. So each document can be described using a vector of probabilities. However, the length of that vector, a “correct” number of topics “k” needs to pre-specified. There are many ways at coming up with some “best” number of topics.

I choose to optimize for semantic coherence, a measure of how distinct the meaning of one topic or theme is from the other topics present in the field. Partly because this makes sense as something to optimize for topics, and also because the developer of the textmineR package has spent a lot of time thinking about coherence, and is very thoughtful in the implementation. We are looking for the number of topics that maximizes the average coherence of topics, so we need to search through some range of parameter space.

So where do we start? One way to do it, when you’re new to a type of corpus, is to start low, around 5 topics, and cast a wide net, all the way up to around 50 topics. Over time, you get a feel for the range of topics in a particular type of corpus, and then you can start with a narrower range. This is important, because the search process can be very time consuming for large corpora. You’re basically estimating an LDA topic model of your text with 5 topics, then 6 topics, then 7, and so on, and for large corpora, each estimation may take a lot of time. Here, I start at 40 topics because I know based on prior runs that sub-40 topics models have low coherence with this dataset.

coh<-data.frame(k=seq(40,60,by=1),coh=0)

#create a vector to store the results of our search 
for (i in 1:dim(coh)[1]) {
set.seed(12345)
notes_ns_lda<-FitLdaModel(dtm = notes_ns_dtm, k = coh[i,1], iterations = 500)
coh[i,2]<-mean(notes_ns_lda$coherence)
rm(notes_ns_lda)
}

plot(coh$k,coh$coh,type="l",xlab="Number of Topics",ylab="Average Coherence of Topics")

Looks like we have a winner. Now lets look at a summary of topics in our corpus

k<-coh[coh$coh==max(coh$coh),1]
set.seed(12345)
notes_ns_lda<-FitLdaModel(dtm = notes_ns_dtm, k = k, iterations = 5000) # more iterations for the "final" run
notes_ns_lda_sum<-SummarizeTopics(notes_ns_lda)
kable(notes_ns_lda_sum)

topic	label_1	prevalence	coherence	top_terms_phi	top_terms_gamma
t_1	status_author	1.50	0.297	status, author, ietf, publication, al	ietf, publication, chair, signals, id
t_2	treatment_control	1.67	0.108	cluster, means, variance, standard, treatment_control	cluster, clusters, fisher, variances, standard_error
t_3	residential_segregation	2.39	0.441	segregation, measure, validity, residential, census	segregation, residential, households, residential_segregation, spatial
t_4	grounded_theory	3.16	0.210	theory, theoretical, grounded, grounded_theory, strauss	grounded, strauss, grounded_theory, glaser_strauss, glaser
t_5	concept_definition	1.96	0.500	concept, qual, indicators, quant, measurement	qual, indicators, quant, concept_definition, concept
t_6	causal_inference	2.26	0.235	inference, causal_inference, causal, unit, units	causation, joint, exposure, autor, exposure_variables
t_7	qualitative_researchers	2.96	0.110	researchers, qualitative, sampling, quantitative, studies	quantitative_researchers, unique, sampling_methods, statistically_representative, based_studies
t_8	regression_discontinuity	0.94	0.085	regression, subjects, study, threshold, compliers	compliers, discontinuity_designs, crossover, units_assigned, fuzzy_regression
t_9	retrospectively_consecrated	1.04	0.970	album, consecration, consecrated, music, critics	album, consecration, consecrated, music, professional
t_10	treatment_control	1.77	0.057	treatment, construct, participants, constructs, control	participants, innovation, administrators, assess_presence, combining
t_11	abandon_prematurely	3.35	0.206	causal, effect, outcome, potential, causal_effect	potential, causal_effect, causal_effects, outcome_treatment, outcomes_treatment
t_12	trial_sample	1.65	0.220	randomization, sample, rct, treatment, trial	rct, trial, rcts, ate, perfect
t_13	regression_discontinuity	1.44	0.167	treatment, assigned, control, units, regression_discontinuity	assigned_control, regression_discontinuity, instrumental_variables, assigned_treatment, instrumental
t_14	sign_object	1.75	0.621	abduction, induction, deduction, observations, theories	abduction, deduction, induction, peirce, abductive
t_15	selective_sampling	2.22	0.187	sampling, parameter, selective, populations, time	selective, adopters, selective_sampling, populations, beta
t_16	field_experiments	1.27	0.099	experiments, field, subjects, lab, field_experiments	field_experiments, lab, social_experiments, substitutes, naturally
t_17	construct_validity	2.18	0.220	construct, validity, construct_validity, constructs, match	construct_validity, match, particulars, prototypical, sampling_particulars
t_18	free_spaces	0.96	0.602	change, spaces, reformers, advent, bayshore	spaces, reformers, advent, bayshore, defenders
t_19	dependent_variable	1.69	0.178	variable, dependent, observations, dependent_variable, selection	dependent_variable, explanatory_variables, explanatory, dependent, indeterminate
t_20	sampling_error	1.14	0.397	sna, analysis, lna, model, sample	sna, lna, mb, mb_sna, se
t_21	target_population	2.07	0.084	population, sample, person, target, network	network, chain, target_population, seeds, target
t_22	method_agreement	2.42	0.088	causal, independent, method, methods, variable	agreement, mill, method_agreement, method_difference, variable_independent
t_23	selection_bias	2.32	0.189	bias, selection, selection_bias, outcome, endogenous_selection	selection_bias, endogenous_selection, bias, successful, offers
t_24	wage_distribution	1.01	0.285	job, wage, change, time, workers	wage, retooling, job, skill, percentile
t_25	silver_blaze	2.01	0.216	tests, test, evidence, analysis, hypothesis	process_tracing, tracing, tests, tannenwald, blaze
t_26	slave_trades	1.92	0.979	slaves, slave, slave_trades, trades, africa	slaves, slave, slave_trades, trades, africa
t_27	free_spaces	0.89	0.481	programs, residents, interns, oppositional, night	interns, oppositional, residents, programs, relational
t_28	potential_outcome	2.88	0.160	treatment, individual, control, potential_outcome, population	catholic, yi_yi, constitutive, constitutive_features, effect_hospitalization
t_29	abandon_prematurely	0.89	0.192	information, task, time, referrals_performance, pieces	referrals_performance, monitoring_treatment, turnover, job_referred, psa
t_30	people_identify	2.87	0.121	context, people, ethnographic, develop, identify	audiences, fieldwork, financial, ethnographic, discerning
t_31	potential_outcome	2.43	0.036	variables, researcher, sample, relationship, yi	block, block_door, computing, iv_dv, quasi
t_32	gram_panchayats	2.06	0.645	women, gram, panchayats, gram_panchayats, reserved	gram, panchayats, gram_panchayats, reserved, council
t_33	regulatory_compliance	2.04	0.300	compliance, organizational, organizations, research, regulation	compliance, legal, regulator, regulatory, regulatory_compliance
t_34	civil_society	2.15	0.600	local, schlesinger, voluntary, civic, government	schlesinger, voluntary, civic, civil_society, classic
t_35	healthcare_providers	1.66	0.486	bribery, baseline, intervention, threat, reputational	bribery, healthcare, healthcare_providers, providers, reputational_threat
t_36	ladder_abstraction	2.98	0.189	empirical, conceptual, concepts, political, extension	abstraction, ladder_abstraction, universals, extension, relevance_potential
t_37	observational_research	2.71	0.131	research, observational, experimental, observational_research, relationship	observational_research, radical, skepticism, experimental_research, radical_skepticism
t_38	treatment_assignment	2.74	0.279	natural, treatment, experiments, natural_experiments, assignment	treatment_assignment, cpos_information, confounders, natural_experiments, causal_statistical
t_39	directed_graphs	2.38	0.156	variable, causal, graphs, path, directed	conditioning, graph, directed, graphs, path
t_40	abandon_prematurely	2.38	0.099	causal, process, qualitative, variable, research	causal_process, data_set, events, level_analysis, focus
t_41	smoking_gun	1.41	0.703	cpos, gun, smoking_gun, observations, smoking	cpos, gun, smoking_gun, smoking, hoop_test
t_42	referred_workers	1.29	0.249	referred, performance, workers, referrers, referred_workers	referred, referrers, referred_workers, referrals, referred_referred
t_43	field_experiment	1.00	0.291	performance, privacy, data, organizational, observability	privacy, observability, organizational_learning, paradox, actions
t_44	retrospectively_consecrated	1.14	0.810	albums, recognition, popular, film, cultural	popular, film, recognition, odds, albums
t_45	surrogate_outcome	2.13	0.351	surrogate, outcomes, outcome, hba, people	surrogate, hba, bone, surrogate_outcome, clinical
t_46	field_experiment	1.01	0.210	transparency, line, lines, study, operators	transparency, operators, shifts, visibility, curtain
t_47	perceptions_quality	2.56	0.144	status, quality, effect, award, boost	perceptions, boost, citations, perceptions_quality, status_quality
t_48	single_unit	1.89	0.115	study, unit, units, single, research	study_research, single_unit, causal_proposition, france, study_method
t_49	diversity_policies	2.07	0.583	unionization, diversity, effects, policies, union	unionization, diversity, union, workplaces, policies
t_50	conceptual_stretching	2.05	0.146	category, categories, attributes, central, conceptual	category, radial, mother, radial_categories, secondary_categories
t_51	abandon_prematurely	2.33	0.178	treatment, experiment, effect, effects, data	post, conducted, reduce, constraints, dummy
t_52	wage_distribution	1.00	0.495	plant, wages, workers, jobs, study	plant, jobs, hires, maintenance, wages

First, the labels. These are machine generated, high probabilty bigrams associated with the topics. Its easier to think of a topic in terms of a label than as a number (topic 1, topic 2). Think of this as the machine’s inductive open coding. I’m a big fan of recoding the topics in a way that makes sense to you. This may also be an occasion to assess how similar your interpretation of a topic is relative to others in your team.

Some definitions:

Prevalence: Pr (topic | corpus) or how “present” your topic is in the corpus. If I were to grab a document at random, how likely would I be to find this topic present? This column sums to 100.

Coherence: How distinct is a topic from all others. Ranges from 0 to 1. Roughly: Pr (Top phi terms | Topic \(\neq\) Top phi terms | Other Topics)

Phi: Pr (Term | Topic). Top terms are those more central to the meaning of the topic

Gamma: Pr (Topic | Term). Top terms are those most exclusive to the topic.

Now, we probably have a slightly better sense of what type of methods class this is. Some notes about topic distributions:

-Highly coherent topics are often rare topics. When many authors in many notes are writing about the same topic, the natural variation in their writing styles and note context makes the topic diffuse. In our corpus, these high coherent topics are often specifically addressed in a paper or chapter we were summarizing.

-High prevalence + low coherence topics are usually “corpus-level concerns” that concern all authors of a longitudinal multi-authored corpus. This being a methods class, these are usually issues of research design.

-Prevalence is related to volume of text pertaining to a topic. So topics covered in lengthier notes are more likely to have high prevalence

-Coherence is related to exclusivity of vocabulary. In highly coherent topics, there is overlap between central and exclusive vocabularies.

Now that we know what topics are present in the corpus, lets take advantage of a most wonderful output of a topic model, the document topic matrix. It is a Very Good Matrix. It is one row per document or note, and each row is a vector of probability that each of the topics in the corpus is present in the document. This vector sums to 1, so its giving you the % presence of each topic in a note. You can do A LOT with this matrix, both qualitatively and quantitatively.

We can use this matrix to look at how topics are related to each other,and how they are distributed over time and authors. This is a way of analyzing discourse.

doc_top<-as.data.frame(notes_ns_lda$theta) # theta is the document topic matrix in the LDA object that we created above.
doc_top$noteid<-rownames(doc_top)
notes_dim<-notes[,c(1,2)]
notes_dim$week<-substr(notes_dim$noteid,1,2)
doc_top<-merge(notes_dim,doc_top) # adding dimensions like noter and week to each document

#To cluster topics on how likely they are mutually present in a note or a reading:
doc_top.mat<-t(as.matrix(doc_top[,4:ncol(doc_top)]))
doc_top.dist<-JSD(doc_top.mat) #Jensen Shannon distance between the probability vector
doc_top.hclust<-hclust(as.dist(doc_top.dist),method="ward.D") #hclusts are data objects about the hierarchical relationship between items. 
plot(doc_top.hclust,labels=paste(notes_ns_lda_sum$topic,notes_ns_lda_sum$label_1),cex=0.65,main="Topics clustered on co-occurence in document")

Topics that are closer to each other are more likely to be mutually present in the same document. Note that these dendrograms can “rotate” along their vertical axis, like a mobile over a child’s crib.

Now lets look at the distribution in prevalence of these topics across authors.

doc_top.l <- pivot_longer(doc_top,cols=starts_with("t_"),names_to="topic",values_to="prevalence") #converting the document topic matrix to a "tidy" form. This helps us aggregate the data quickly in various ways.

#annoying bug in textmineR - topics 1 through 9 dont have a leading 0.
doc_top.l$topic[doc_top.l$topic=="t_1"]<-"t_01"
doc_top.l$topic[doc_top.l$topic=="t_2"]<-"t_02"
doc_top.l$topic[doc_top.l$topic=="t_3"]<-"t_03"
doc_top.l$topic[doc_top.l$topic=="t_4"]<-"t_04"
doc_top.l$topic[doc_top.l$topic=="t_5"]<-"t_05"
doc_top.l$topic[doc_top.l$topic=="t_6"]<-"t_06"
doc_top.l$topic[doc_top.l$topic=="t_7"]<-"t_07"
doc_top.l$topic[doc_top.l$topic=="t_8"]<-"t_08"
doc_top.l$topic[doc_top.l$topic=="t_9"]<-"t_09"

#now cast back to "wide" for a heatmap
topic_noter<-  doc_top.l %>%
  group_by(topic,noter) %>%
  summarize(mp=mean(prevalence)) %>%
  pivot_wider(values_from=mp,names_from=topic)

#preparing data for the Heatmap
topic_noter.m<-as.matrix(t(topic_noter[,2:ncol(topic_noter)]))
rownames(topic_noter.m)<-paste(notes_ns_lda_sum$topic,notes_ns_lda_sum$label_1)
colnames(topic_noter.m)<-topic_noter$noter
colorpalette <- colorRampPalette(brewer.pal(9, "Greens"))(256) # this sets the color range for the heatmap

heatmap(topic_noter.m,scale="col",cexCol=0.8,cexRow=0.7,col=colorpalette)

Note the hierarchical clustering of rows and columns. Noters who are closer to each other have produced more similar notes. Topics that are closer to each other are used similarly by the noters. You can already see topics related to reading subject material, and general “class-level” concerns.

I wonder what this would look like if we had retained stop words. Would it look different?

Once we have a few more weeks of data we can look at time effects. How does the unfolding of the semester affect things like length of note, or emphasis on particular topics, and can we identify the intent of the professor in assigning a particular set of readings to a week?

To be continued…