For your final project, you will take a dataset, explore it, tinker with it, and tell a nuanced story about it using any method of automated text analysis covered in this class. I want this project to be as useful for you and your future career as possible - you’ll hopefully want to show off your final project in a portfolio or during job interviews.
Accordingly, you have some choice in what data you can use for this project. I’ve found several different high-quality text datasets online.
You do not have to choose a dataset in your given emphasis. That is, you can find and use your own text data. Choose whatever one you are most interested in or will have the most fun with.
Literary works
gutenbergr
packageReview
IMDB Dataset of 50K Movie Reviews: IMDB dataset having 50K movie reviews for natural language processing or Text analytics
515K Hotel Reviews Data in Europe: 515,000 customer reviews and scoring of 1,493 luxury hotels across Europe
Spam
90 Twitter datasets available - data.world: Real-world Twitter data that can be used for text analysis.
Write a memo using R Markdown to introduce, frame, and describe your story and figure. Use the final project template to get started. You should include the following in the memo:
The skills covered in this course are rooted in analytical skills for text data rather than formulas and equations. As such, the application of these principles to a real data problem is one of the best ways to learn and assess mastery of these skills. I guarantee you one day you will need to apply these principles to communicate an idea or a story to audiences, so let’s make sure you have at least one chance to practice before the stakes are higher.
Your final project should illustrate your ability to transform raw text data into insights by making the unstructured structured, showing clear trends or patterns, and / or identifying information from text data. The specific skills involved in achieving this goal include all of the course learning objectives listed on our E-class page.
You should work on this final project individually. You may work in a team to get some help from each other, but everyone must finalize and submit your individual report on your own.
Use the final report template for your analysis and report. Your final report should be written as a .Rmd file that compiles to a html webpage. Publish your compiled page online (e.g. via RPubs, Github, etc.), then submit your entire files (including your .Rmd file, data files, image files, etc.) as a single .zip file on E-class by the due date. Also include the URL to the published HTML report page in your E-class submission.
We will use this rubric to grade your report.
Your final report should be a fully reproducible product and available online as a html webpage. It should include text, data, code, and plots. Below is a list of specific items your report should include (check the rubric to see their relative weighting).
State your research question and motivate why it is important / why the reader should care.
Describe your data:
This assignment is inspired and/or modified from other sources, including: