As students and activists, we refuse to trust authority. Our society is filled with crooks, liars, thieves, politicians, cops, and bourgeois bores–all of whom consistently lie and often spread lies even when they think they're telling the truth. However, as the social web of lies becomes increasingly vast and hard to navigate, the ability to quickly and easily analyze huge sets of public data is growing just as fast. For this reason, a nimble and critical capacity for do-it-yourself statistical data analysis is absolutely necessary for self-defense in this social war of often aggressive and malicious deceptions. Luckily, in addition to the availability of free and massive data, free, open-source and professional-quality statistical software is now readily available to anyone with an internet connection.
Even before the emergence of internet-age info abundance, people have been earnestly collecting and compiling millions of cold, hard data, especially since around the middle of the twentieth century. These data typically come in the form of big spreadsheets that record all kinds of the most interesting data about what people have thought and how they've acted all over the world. The problem is that only very few people in the world really have gained the education (and time and energy) to analyze these datasets independently. This is one of the reasons why so many people can get away with spreading all kinds of the most absurd lies, including mainstream media. It's because, traditionally, the knowledge to independently analyze data has been restricted to elites–advertisers, the military, professors, and at best, some PhD students such as myself.
However, over the past few years, it's become really cheap, quick, and easy to obtain and analyze all kinds of big datasets. But still, for some reason, only elites ever use the damn things! In part this is because statistical data analysis is made out to be this big, hard, scary science that you have to study for years before you can independently analyze the social world.
Well, that's all a bunch of bullshit. What follows is a step-by-step walkthrough of how to use statistical thinking (and the free, extremely powerful statistics program called “R”) to analyze hundreds of different questions about American society since 1972.
To illustrate how we use quantitative methods and statistical thinking to test theories, let's use a very general research interest as an example. Let's say we've been thinking and reading a lot about about race in the United States. Some people say that the United States is a “post-race” society and that, after the civil rights movement, white people and people of color are all equal. Others say that our society is still deeply racist and that our social institutions still oppress people of color worse than white people.
So in very general terms, we could formulate our research question as follows: “Is the United States a racist society?” Then we can just go into the data and see if there is evidence that the United States is a racist society! Great, except, what the hell is racism, exactly? What does it look like? Datasets only give us recorded observations–who did what or who has what, when and where? To investigate racism using data, we have to be able to say what it looks like before we go into the data! Based on what we think racism looks like, we outline a strategy for exploring the data–our own little battle map!
We draw this battle map by outlining hypotheses. Hypotheses are short and specific statements about the connection we should find between certain variables in the dataset, if racism does exist in our society. We might think that two different social groups have different levels of something, or groups that have more of one thing have more of some other thing. We might think there is no difference between two groups, and so on. Your hypotheses can be any kind of connection you can think of, because we're going to learn a bunch of different techniques designed for almost any kind of connection you want to investigate.
Two dirty little secrets about developing a research strategy.
No statistician, scientist, professor, or politician can tell you how to define anything. In rhetoric and logic, there's something we call “the right to define.” Definitions can be debated and criticized of course, but something to remember is that there's no statistical test for what counts as a true definition of a research interest or theoretical concept! You can state the definition of some social concept such as racism however the hell you want, and as long as you define clearly how you know it when you see it, you can analyze it scientifically.
When thinking about what some social phenomenon looks like, you have to look at what's available in the datasets! Sometimes it's good to think freely about exactly what you think some phenomenon (such as racism) looks like, and then focus your research strategy based on what kinds of things the dataset actually records! But the best and most useful practice is to draw your research battle map while browsing all the variables in whatever dataset you're using.
To put this into practice, let's return to our example research program. First, we'll use what we already know (from what we see in the streets, from what we feel, from books, from the news, from scientific journals) to develop a definition of racism. After setting out what racism is, or while stating what it is, we have to emphasize what it looks like. Finally, after we state what it looks like, then we go into the data to see if it's really there or not!
What is racism?
What does racism look like, or how would we know it if we saw it?
If racism really exists in the United States, I hypothesize that we'll find certain specific patterns or connections in the data. (Now I definitely have the codebook of variables open, because these are the options available to me right now!)
H1. In the United States, the average socioeconomic status of white people is greater than the average socioeconomic status of black people.
H2. In the United States, white people have more education than black people.
H3. In the United States, white people vote more than black people.*
Introduction
Literature Review
Data and Method
Analysis and Discussion
Conclusion