This is the capstone research project for this class, blending statistical analysis and communication. The UpShot is widely read by academics and non-academics. Each post is generally well-written, has an interesting question and a key point, and gets to the point fast. It features statistical analysis and visualizations as needed to drive the key point home and address likely reader questions. A non-specialist could read an UpShot post and come away smarter. Whatever you do later on, it’s (probably) the kind of output you’ll need or want to produce to convey quantitative insights to a broad audience. (Recommend time: similar to what you would invest in a final research paper or project)

I recommend starting the blog post early.

The blog post is an assignment where there is more subjectivity involved in its evaluation. To adhere to the principles of labor-based grading, I will allow for multiple submissions of the blog post. The idea here is that you can elicit feedback and evaluation on where on the effort tier the post stands and can choose to put in more time and effort to raise it another tier.

Effort tiers

The effort tiers here are cumulative, i.e. to get a “complete” you must satisfy the “basic” and “complete” requirements.This assignment is inspired by Professor Wolcott, and many of the criteria below are borrowed from her rubric. Blog posts at any level must include a bibliography.

  • Basic: At a minimum, a blog post should have a clearly defined and stated purpose/question. It should be obvious to the intelligent layperson reader why they would be investing time in your post. There should be at least one data visualization/graphic. The visualization should satisfy the following criteria:

    • It should be helpful, and aid the reader’s understanding of the article.
    • It should be necessary, and without it the article would be less effective.
    • It should be easy to understand (e.g. cleanly labeled, minimal chartjunk), so that the reader gets the point of the visualization in seconds.

    See “Visualizations” down below for some helpful advice.

  • Complete: To be a more complete representation of your learning from this class, a blog post should include some data analysis. The post should contain some basic data description, robustness checks, a regression with appropriate interpretation/presentation of the results, and a “Statistical appendix” explaining the messy details for the interested reader. The data description should be enough that your reader understands what’s in the data and where it comes from (words are fine in the main text). The robustness checks should be enough to answer some questions the reader may have about the generalizability/quality of the finding(s) (one or two in the main text is enough). If you make any causal claims, you should explain the necessary assumptions clearly in words; the reader should be able to understand these assumptions and when they might be violated. The results should be presented in a way that a reader who hasn’t taken ECON 210 can follow along (e.g. don’t just drop a table of R/Stata output in there, explain in words and maybe make a picture).

    • The main text of the post should not be too heavy on the statistical analysis. Weave it in with the narrative, and don’t bore your reader.
    • In the process of writing the post, you will likely find that you have more data description, robustness checks, or theory than would be appropriate for an UpShot-style blog post. This is what the statistical appendix is for. It’s a good place to show tables of summary statistics, details on how you merged data (if you merged anything), additional robustness checks (if any), and causal diagrams (if you’re making causal claims).
  • Extensive: To receive an “extensive” appraisal, you must (1) incorporate evidence outside the data, (2) write well, (3) do the analysis in both R and STATA, and (4) provide a replication package for your results. (1) involves choosing an appropriate set of peer-reviewed papers and seamlessly integrating them into the arguments, summarizing key points or issues in the sources cited to critically analyze those ideas and relate them to the post’s purpose. (2) involves controlling pace, rhythm, and variety; words chosen should be apt and precise; sentences should flow smoothly together and clearly open, develop, and close topics. Use the active voice. I recognize that “writing well” is a subjective thing. I encourage you to go to to the Writing Center for help with writing. On (4): A replication package is a folder (OneDrive Folder) that contains:

    • All the scripts (.r files, .do files, .Rmd files. stmd files) you used for your analysis, with comments inside the code chunks (using #) explaining what each line/block of code does. You do not need to replicate your data cleaning and merging code. For example, if you clean and merge your data in R, you do NOT need to duplicate this in STATA. Only the analysis that you do in the blogpost (figures, graphs, tables) need to be replicated in both R and STATA.
    • A readme file (a text/markdown file named README.txt or README.md) explaining how the scripts should be run and in what order, and what datasets are necessary
    • All the datasets you used Someone who knows R and Stata but doesn’t really know what you did in your project should be able to take your replication packages, follow the instructions in your readme, and replicate every result/figure you used in the blog post and statistical appendix without modifying the code scripts at all (it should be a turnkey experience; push the buttons to run the scripts and they just work). Replication packages are an important part of open science, and are increasingly a requirement for publication in academic journals and prestigious non-academic outlets.

Generate a link to this replcation package folder and include it in the statistical appendix of your blog post.

Rubric

Feedback

You are welcome to chat with me (office hours, schedule an appointment, Slack) about your post.

The blog post is an assignment where there is more subjectivity involved in its evaluation. To adhere to the principles of labor-based grading, I will allow for multiple submissions of the blog post. The idea here is that you can elicit feedback and evaluation on where on the effort tier the post stands and can choose to put in more time and effort to raise it another tier.

To do this, just submit your post on Canvas and send me a DM on slack. Allow about 72 hours for a response. If there are is a surge of submissions in the final weeks, I may have to extend the response time (all the more reason to start early!)

Note that there is a final deadline for the posts. This date is on Canvas. After this date, no resubmissions will be possible.

Examples

Aim for the style from in this NY Times post with the caveat that you’re reporting on your own analysis, not someone else’s

How to get started

What are you interested in?

Start writing down a few topics/subjects that you have interest in. Read about them. Keep a notepad with possible research questions.

Write down a research question

Typically, a good research question to ask is specific and has an independent variable (X) and a dependent variable (Y).

Here are some examples:

A: What is the effect of exercise on mental health?

B: What is the effect of legalization of recreational marijuana on drug overdoses?

C: Is there racial inequality in criminal sentencing?

For A and B, you can write down a causal graph, while C is probably more of a descriptive analysis.

How will you measure your X and Y?

Following the examples A,B, and C from above, here are some ideas:

A: Exercise = X and will be measured as number of hours per week, Mental Health = Y and will be measured using a scale of mental health

B: Legalization = X and will be equal to one if a state has legalized it and zero if not, Drug Overdose = Y and will be the number of OD per 100,000

C: Racial Inequality = X and will be 1 if a defendant is a person of color, Sentencing = Y and the number of months a person is sentenced

What is your unit of observation?

This is really important and will help you in your search for data. Remember, the unit of observation tells you what each row is in your dataset. Following the examples A,B, and C from above:

A: Individual-level data

B: State-level

C: Individual or County level

Get data

There are a lot of different ways of doing this. To be honest, I usually do a first-pass on Google. Here are some additional ideas:

  1. Use Social Explorer or IPUMS
  2. Google has a dataset search tool
  3. Data is Plural has a spreadsheet with links
  4. Ryan Clement (The most awesome data librarian ever!) has a site with collected links

After you have done some preliminary searches and aren’t finding what you want, you can schedule an appointment with Ryan Clement directly.

To make this appointment as helpful as possible, you should let Ryan know

  • What type of data you are looking for
  • What is the X and Y
  • What is the unit of observation
  • Where have you looked already

Ryan is amazing and thus popular, especially late in the semester when everyone is working on a final project. You will probably increase your chances of meeting with Ryan if you plan accordingly.

Milestones

Is is very easy to procrastinate. I will have milestones in the form of periodic quizzes to check in on your progress.

  • What is your topic and question?
  • What is the X and Y? What is the unit of observation?
  • Do you have the data?

Visualizations

Professor Rao, Bea Lea, and I recommending watching both videos. This should put you in a good position for creating a strong data visual.

A brief introduction to data visualization.

Advice on using ggplot to create strong visualizations

FAQ

How should the blog post be produced?

A knitted html. Why? I would like to be able to create a portfolio of posts. Note that I will ask for permission before including any posts in a portfolio. This collection maybe presented to the college community as examples of the amazing work that students are doing.

How should I include the Statistical Appendix? Separately, or in the same file as the main text?

Include it with the main text. Your final post should be a single file with the main text, figures, bibliography, and any appendices.

What should I do with my code chunks?

Your code and warning messages should not be visible in your final html post. Typically you have two types of code chunks

-Chunks that manipulate the data (i.e. load, mutate, join, merge, etc . .). You don’t want to see the code or messages. Simply add include = FALSE in the chunk.

-Chunks that generate a result (i.e. ggplot). You don’t want to see the code or messages, but you want to see the result. Add the following to the chunk echo=FALSE, message=FALSE, warning=FALSE

See this post for more details.

An example markdown file and html output using Problem Set 9 data.

Think of your post in layers. The main text is for a general audience who cares a lot about your question and findings, but not so much about your methods and the details. The statistical appendix is for a more specialized audience who cares a lot about your methods and the details as well, but doesn’t want to see your code. The replication package is for a still-more specialized audience who wants to have the data and recreate your analysis themselves. Only the final layer of readers cares about your code, and even they don’t want to see it woven into your text.

How much should I describe my data?

At the Basic tier, you should describe your data well enough that a reader who is not steeped in your question understands what the variables are, how they were collected/constructed, what different values mean, and some of the issues which might come up in applying the data to your question. You should tell the reader how many observations you have, what time periods/regions they cover, what the unit of observation is, and what the units of measurement are. You can express this in 1-2 paragraphs.

At the Complete tier, you should include summary statistics of the variables in your statistical appendix. If many observations are missing, you should describe briefly whether this is likely to be a problem for your analysis or not. If you had to construct measures for your study, you should describe how you constructed them and what assumptions your procedure entailed. You can express all this in a table and 1-3 paragraphs (though if you need more space, you may use it)

How long should the post be?

Your main text should be no more than a 15-20 minute read, including time spend looking at figures. The average adult (apparently) reads at 225-250 words per minute. Suppose you include 2 figures, and each takes (no more than) 3 minutes to interpret and digest. That gives you a total budget of 3150 words, maximum.

Do not make it too long. I would recommend targeting 15 minutes for the main text, giving you a tighter budget of around 2025 words (including 2 figures). You can always put more detail in your appendix and refer to it (e.g. “This is robust to alternative definitions of my outcome variable; see Appendix for details.”)

How should the post be formatted?

You have some creative leeway here, but please make sure the flow is readable and clear. You don’t need to have an “Introduction” section header, but using section headers is a good idea. Use paragraphs. Your title should be informative. Your main question should be stated clearly within the first couple hundred words (I should not have to wait more than a minute to learn what your question is).

Are we expected to use R for creating the graphs or would it be okay to use Google sheets or Excel?

Yes, I expect you to use R or Stata for graphs in the blog post. A big part of the class is learning to use statistical programming software to produce high-quality analysis and graphics; you should incorporate that learning into your final project.