What is the end goal of this analysis?

The core answer is that I need to finish an A-exam, which translates to preparing a publication.

What paper?

I intend the paper to be methods- and theory- heavy, something akin to this Poetics special edition I loved so much.

Most sociological literature is trying to understand something so they can fix it.

I’m trying to understand the development of theory in Sociology. I want to know why we progressed in this way, instead of that way. What constructs sociology? What are the forces guiding it? If different forces would have been applied, what would have happened?

It could be about “intellectual history,” uncovering the “true source” of an idea. But I think it has much more potential than that. A more interesting question is:

Given a sea of sociological writings, how is a canonical, commonly understood language built up?

Is Sociology progressing?

You have to be careful with questions like this. Certainly those that are doing it have some idea what the point is. And it is indeed progressing haphazardly towards the ambitions of those who inhabit it (Sociology).

What is scientific literature accomplishing, and how is it doing that?

What can do to fix problem X? Here’s the problem, and here’s a way to fix it.

Does sociology ever come to a “consensus” just for this consensus to shift, and the consensus turned over?

Are there any direct contradictions, clear cut disagreements, either over time or in a single time? What does the map of these contradictions look like?

Of course the nitty-gritty question is how I can identify contradictions in the first place. One way is to split sentences into cleaned phrases (parts of sentences), and see if one author says “PART is undeniable” and another says “PART cannot be true”.

How long do “conversations” go on?

I’ve seen a few which last just a few papers, and others which balloon into global phenomena. What does this diffusion network look like? What ends a conversation? What keeps it going? How has this changed over time, with new global forces in the academic world?

Where to publish?

I would like to shoot for the journal I end up publishing in.

I’ve analyzed recent publications in Sociological Theory and find that the paper I want to write won’t fit well. I moved on to analyze Social Studies of Science, and found that journal even more incompatible with my aims.

Obviously AJS is a good choice, but the likelihood of getting in are quite slim.

Another option is to publish a preprint, with the paper organized how I see fit, in ways that best explain what I did. I could even self-publish. The only down side is the need for peer review, which would be invisible.


Social Moments seems like a great option.

  • Targeted at grad students, so my competition (and political barriers) are limited
  • Peer reviewed
  • Interdisciplinary
  • Free!
  • “editorially independent”

Theory: specifying a theoretically interesting analysis

What are the good questions to ask?

The Positivist

One way of stating this is asking whether existing theories are true or false. The steps in answering this question are as follows:

  • Find well-formulated theories of the sociology of sociology
  • Are they testable with the data I have? How?
  • Are they important?

The Qualitative Sociologist

In this case our question wouldn’t be is this true?, it would be what is happening?. That is, what generalizations can I make?

These descriptive, inductive questions are actually themselves secondary - that is, after - the questions how should I look? and what should I look at?.

In this model the analysis procedes as follows:

  • Specify a multitude of ways to look at the phenomenon, hoping for triangulation
  • Look
  • Carefully make generalizations from what you see

These steps are just a heuristic, and multiple steps can be happening at once, but it’s useful

For the moment I’m going to stick to the latter. I’m not convinced theoretical guidance makes for good Sociology.

Ways to look, and what they might let us see

What are the extremes of the distributions?

Construct metrics of success, conceptual brokerage, conceptual creation. In these metrics, who are the outliers? How skewed is the distribution?

Clustering concept trajectories over time and (social) space

The input of this method is the trends of concept change over time generated above. A distance is calculated between each pair of trends and a cluster analysis is done, grouping those that are similar. This grouping of trends is then examined by a person, looking for patterns among these patterns, and applying some interpretation.

Examinations of where, when, and for whom a certain “meaning shift” happened can be investigated. The researcher then connects these trends to development of meaning in some concrete case, showing that these trends can be used as indicators for X.

Questions, and how to answer them

Where do authors publish?

Do they stick to the same journal or are they more varied? Over time how does this propensity evolve? Do authors “settle in” to places of publication?

Methods: creating derivative datasets for analysis

Cleaned body of the article

  • Headers, footers, and footnotes are stripped from the document
  • The “top-matter”, the abstract, and references of a paper are removed.
  • Tables and other non-textual information is removed.
  • Hyphenation broken by page is remedied
  • Pages are bound together, respecting hyphenation
  • The document is split by sentence
  • The document is word-tokenized

Citation network

Author metadata

Additional metadata about an author is taken from their CV. We pay MTurkers to go and find the CV for the person. If an MTurker doesn’t find a CV we ask a second Mturker. If they can’t find it either, we consider the data missing.

Once the PDF is received it is automatically tagged by a trained – Tagged for sections.

Metadata is then parsed automatically using an Easley parser. Institutions are detected, along with their context. Names are mentioned, along with their context.

Textual and institutional context of each mention of central sociological terms and figures

Central terms of interest are taken from a glossary in a commonly used and modern Sociology textbook. Terms are marked with their part of speech and various alternative surface forms. Each sentence of each document is checked for each term On a match, the exact position and document are recorded, as well as the full sentence in which the term appears.

The famous figures search might be a bit more complex, as names could mis-match.

The name and institution are extracted from the body of the document itself. Using the author’s current CV (see Author metadata), we create institutional variables for their involvement at the time they published the piece.

Because there is some uncertainty in where they were when they wrote it, we might want to keep this into account when making conclusions about their attributes.

Relevant terms appearing in each article

Uses a ground-up approach to extracting useful or important terms. These can be nouns or compound nouns, adjectives, or relations between statements. There is some literature on each of these, and what makes them “interesting” is something which can be rigorously specified.

Clustering of term network change over time and place

Options:

  • An edge can be counted if two terms are in the same document, the same sentence. Alternatively weighted edges can be distance-based, or based on syntactic relations.
  • The aggregation of these edges over documents can be a simple sum, a log frequency.
  • Post-hoc weighting, for example based on tf-idf, could be applied
  • We can tweak what terms are part of the network in the first place, either a priori or after constructing a complete network
  • There are a few options for a distance metric between egonetworks of a term
  • Distance metric between time trends

The method uses a window of 5 years for “now”, measuring the difference to the 5 year window starting next year. The output is a single number for each slice (its distance to the previous slice). This should be nonnegative. This windowed method increases the size of the time series and smooths trends, while providing ample data for accurate construction of the term-term network at each time.

This magnitude of change can be extracted on different sets of documents, producing trends at different levels of granularity:

  1. Whole dataset
  2. Each journal
  3. By geographic origin of writers
  4. By stated subdisciplines / keywords

A methodological note This method is extremely computationally intensive. Keeping track of counts of all tuples takes a lot of space and computational resources. This can be avoided to some extent by surveying a small set of representative documents and identifying terms of interest. Constructing a network for these fewer concepts is a bit easier.

Methodological note 2 At the moment there is plenty of junk finding its way into my dataset. I successfully removed bibliography, headers, and footers, but I have yet to extract tables. It doesn’t pose much of a problem, because these “terms” form their own distant cluster.

We then filter the trends, keeping only those which show substantial variation and which have enough presence in the documents. Because we have several levels of granularity, we can exclude terms which don’t change across the dataset, or are constant within the journal of interest. What we’re left with is a rich indicator of semantic change over time for each term.

One possible next step listed below is to cluster these trends, looking for changes which co-occur, and can be said to constitute the same change in language.

Term co-presence over time and place

The above analysis can be simplified a bit by considering the trends of specific edges, instead of entire egonetworks. The same idea of filtering and clustering to find trends could be applied.

Semantic network

Anything which can be described by an Easley parser can be run efficiently against sentences. This method can’t be used to express certain intuitive syntax, but is useful in extracting a small set of meaningful phrases. Sentences which can be decomposed are highlighted as to the parts that are decomposable. This is done for about 1000 sentences. These forms are then applied to the full dataset, and false positives are coded away until the data looks mostly clean. Each surface form is interpreted as a relation between various sub-statements.

Superfluous add-ons are also specified based on a reading of the sentences, such as “it is apparent that”.

Sub-statements must then be reduced, grouped into similar or the same referents. Those statements which have sufficiently many relations are exported to a Neo4J graph which can then be queried. Each edge of the graph is labeled with the historical and textual context of that link.