If submission: full credit + replace an earlier assignment grade
A reminder to use GitHub commits
Presentations!
A reminder to engage / ask your peers questions! (part of rubric now)
Remember the “families” of text analysis
Term frequencies
Document structure
Semantic similarity
Document Structure
The second family, document structure analysis, assumes one can extract from word co-occurrence statistics what any given document is “about” (i.e., what the appropriate keywords or themes are) and represents text as observations that vary on this feature.
Document Structure
Each document has “themes” or “topics”
Words combined with each other can comprise topics (e.g. “march” with “january” vs. “soldier”)
Topic Models
Documents are comprised of topics
Topics are comprised of words
Researchers must length of document, proportion of document expected to come from each topic
Algorithm gives us words related to each topic
Latent Direlecht Analysis
Each topic \(\theta\) has a distribution
Each word \(\phi\) has a distribution within each topic
Large Language Models (LLMs) are neural networks with billions of parameters trained on large amounts of data
Model: a statistical tool that uses data to make predictions
Language: text data, such as words and topics
Large: Lots of data!
Increasingly, LLMs refer to generative models
Large Language Models
Large Language Models (LLMs) are neural networks with billions of parameters trained on large amounts of data
Model: a statistical tool that uses data to make predictions
Language: text data, such as words and topics
Large: Lots of data!
Increasingly, LLMs refer to generative models
Large Language Models
The Paper Factory
A single LLM can’t write a sociology paper
But a multi-agent workflow can (?)
The Paper Factory
Sociologists Engzell and Wilmers automated the entire process of writing quantitative sociology papers
Their model: a multi-agent workflow
The Paper Factory
The Paper Factory
In pairs:
What are some of the benefits of automating social science research? The risks/drawbacks?
Benefits of LLMs
Codifying research processes and “hidden curriculum”
Open and reproducible science
Freeing up time for non-trivial tasks
Risks of LLMs
“Mediocrity at scale”
Destruction of peer review
The Future of Surveys?
Silicon Sampling refers to “sampling” LLMs rather than humans
Argyle and colleagues (2023) argue that we can learn much from surveying LLMs (trained on human backstories)
“The information contained in GPT-3 goes far beyond surface similarity. It is nuanced, multifaceted, and reflects the complex interplay between ideas, attitudes, and sociocultural context that characterize human attitudes.”
The Future of Surveys?
Silicon Sampling refers to “sampling” LLMs rather than humans
Illusions of AI?
Messeri and Crockett (2026): “The proliferation of AI tools in science risks introducing a phase of scientific enquiry in which we produce more but understand less.”
Illusions 1: Breadth
Illusion 2: Depth
Illusion 3: Objectivity
Benefits for Society
Reduced administrative burdens?
Differences in valued skills?
Risks for Society
Existential risks
“If somebody builds a too-powerful AI, under present conditions, I expect that every single member of the human species and all biological life on Earth dies shortly thereafter.” - Eliezer Yudkowski (2023)
Future of Social Science Research
We’ve covered a lot of methods for computational social science research
Data viz
Spatial analysis
Networks
Prediction and algorithmic modeling
Text analysis
So … what is computational social science?
Anything that’s cool?
Future of Social Science Research
In the era of LLMs, division between social scientists and data scientists may be sharper than ever
Future of Social Science Research
What do you think the future of computational social science be?