Final Project

For your final project, you will take a dataset, explore it, tinker with it, and tell a nuanced story about it using any method of automated text analysis covered in this class. I want this project to be as useful for you and your future career as possible - you’ll hopefully want to show off your final project in a portfolio or during job interviews.

Accordingly, you have some choice in what data you can use for this project. I’ve found several different high-quality text datasets online.

You do not have to choose a dataset in your given emphasis. That is, you can find and use your own text data. Choose whatever one you are most interested in or will have the most fun with.

Literary works

Review

Spam

90 Twitter datasets available - data.world: Real-world Twitter data that can be used for text analysis.

Instructions

Write a memo using R Markdown to introduce, frame, and describe your story and figure. Use the final project template to get started. You should include the following in the memo:

Due: June 17 by 11:59 pm

Weight: This final project is worth 30% of your final grade.

Purpose:

The skills covered in this course are rooted in analytical skills for text data rather than formulas and equations. As such, the application of these principles to a real data problem is one of the best ways to learn and assess mastery of these skills. I guarantee you one day you will need to apply these principles to communicate an idea or a story to audiences, so let’s make sure you have at least one chance to practice before the stakes are higher.

Skills & Knowledge:

Your final project should illustrate your ability to transform raw text data into insights by making the unstructured structured, showing clear trends or patterns, and / or identifying information from text data. The specific skills involved in achieving this goal include all of the course learning objectives listed on our E-class page.

Teams:

You should work on this final project individually. You may work in a team to get some help from each other, but everyone must finalize and submit your individual report on your own.

Submission Details:

Use the final report template for your analysis and report. Your final report should be written as a .Rmd file that compiles to a html webpage. Publish your compiled page online (e.g. via RPubs, Github, etc.), then submit your entire files (including your .Rmd file, data files, image files, etc.) as a single .zip file on E-class by the due date. Also include the URL to the published HTML report page in your E-class submission.

Assessment:

We will use this rubric to grade your report.

Tasks:

Your final report should be a fully reproducible product and available online as a html webpage. It should include text, data, code, and plots. Below is a list of specific items your report should include (check the rubric to see their relative weighting).

  1. Follow these formatting rules:
  • As the report will compile to a html webpage, there is no length requirement; your report should be sufficiently detailed to address the requirements listed below and sufficiently concise such that it is expressed in the fewest necessary words.
  • In your markdown YAML header, include the project title and the name(s) of student(s) involved in the project such that they appear at the top of the rendered html page.
  • Your report should be fully reproducible - all data formatting and charts should be written in code chunks and rendered when you compile your .Rmd file to a html webpage.
  • Your report should be written in a narrative format (i.e. using coherent paragraphs rather than a series of bullet points). You may use headings where appropriate to break up your report into sections.
  • Proofread your html webpage before you submit - double check for spelling and formatting errors, especially rendered charts and tables!
  1. State your research question and motivate why it is important / why the reader should care.

  2. Describe your data:

  • Download a dataset and explore it. Many of these datasets are large and will not open (well) in Excel, so you’ll need to load the CSV file into R with read_csv(). Your chosen dataset should include texts that you can use for analysis, and many have additional variables too, so you can use them for grouping and summarizing. Your past lecture scripts and homework assignments will come in handy here.
  • Articulate the main variables of interest in your project, and justify your choice of variables.
  • Provide descriptive statistics for your text variables. These can be a mix of graphs and summary tables.
  • You don’t have to summarize everything in every data set - just the variables that are relevant to your analysis.
  1. Describe your results:
  • Find a story in the data. Explore that story and make sure it’s true and insightful.
  • Display charts that either support or oppose your research question or illustrate what else you might need to address your research question.
  • Write narrative text around your charts to explain what the charts show and their significance towards addressing your research question. This should read as a continuous story rather than as a reply to each of the requirements described here.
  • Your plot type choices should highlight the main point(s) you want to make or clearly show the relationship you want to emphasize. Basically, they should answer the question “what do the text data say about my research question?”
  • Your charts should be polished, following the design principles we have covered in class.
  • Export the charts you created into the folder named “images”.
  • You must include at least three different polished tables/charts (i.e. don’t just make three barplots of word frequencies).

Sources:

This assignment is inspired and/or modified from other sources, including: