This document describes guidelines for your:

  • Final Project Plan and Data Source Proposal (5% of Your Final Grade, due in Moodle by Friday, 6 April at 6:30 pm Eastern).
  • Final Project Report and Presentation (20% of Your Final Grade) Presentations are Wed May 10 from 6:30 pm to 9:30 pm; report is due at by 11:59 pm Eastern of the on Friday, May 12 at 6:30 pm via Moodle.

The final project for this class will consist of an analysis on a data set of your interest. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set in a meaningful way.

Note: You can work individually or in groups of up to 3 students for this final project.


It is important that you choose a manageable data set. This means that the data should be readily accessible and large enough that multiple relationships can be explored. As such, your datavset must have at least 50 observations and more than 5 variables/attributes. (Exceptions can be made but you must notify me before.) Ideally, the data set’s variables should include categorical variables and numerical variables.

All analyses must be done in RStudio. An exception will be made if you want to use models, data sets and libraries in HuggingFace hub; in that case you would be better off using Python but I am not planning to cover that till the end of the semester, if we have time. Also, if you are using a data set that comes in a format that we have not encountered in class yet, make sure that you are able to load it into RStudio (for example reading data from JSON files, performing web-scraping, or using an API to gather observations). If you are having trouble, ask for help before it is too late.

Some available data sources for this final project are listed at the end of this document. Please contact me as soon as possible to discuss the possibility of using an alternative data source.

Final Project Plan and Data Source Proposal

The Final Project Plan and Data Source Proposal is a draft of your introduction and motivation behind your selected data set and topic. Write the about your topic in a R markdown file called ideas.Rmd.

This is due on or by April 8 at 6:30 pm. See Moodle turn-in link for details

Generating ideas is the first step to starting any project! This is the place for you to generate topic ideas and identify data sets that could be used to explore these topics.

Identify topics of interest for the project. They should be detailed enough, in a way that you have some guidance as you start analyzing your data. For example, a topic idea may be “characteristics of popular movies.”

Please note that if you are doing your project in a group then each group member should still turn in a submission (ideas.Rmd and your data file) for the proposal. Ideally the proposal submission from each group member would be basically the same. Please state who you are working with if you are doing this in a group.

Deliverables for proposal

  • You will upload your work using the provided zipped folder that appears on the Assignment “Final Project Plan and Data Source Proposal. See Moodle turn-in link

  • Put ideas.Rmd and your raw data set in the provided zipped folder which you will upload via Moodle in the corresponding assignment page for the Final Project Plan and Data Source Proposal.

  • ideas.Rmd should contain the following sections:

Introduction

The introduction should introduce your general research topic and your raw data (where it came from, how it was collected, what are the cases, what are the variables, etc.).

Data Description

Print out a “glimpse” of the data frame. Create a data dictionary that is neatly formatted and easy to read.

Preliminary Exploratory Data Analysis

The exploratory data analysis should include the following:

  • Uni-variate summary statistics and data visualizations.
  • Bi-variate and/or multivariate data visualizations and summary statistics if applicable.
  • Narrative about what you observe in the exploratory data analysis and what you learn about the data from the exploratory data analysis.
Research Questions

Using what you learn from the exploratory data analysis as a guide, formulate research questions that you can explore with the data chosen for your project.

The Final Project Report and Presentation

The majority of the write up will be the revised results from the above proposal and data analysis. After providing the description of your data set and research questions in the introduction, write up the results of the data analysis, incorporating any feedback you received from me during the first stages of the project. Remember to pay attention to your presentation. Neatness, coherency, and clarity will count.

This means that for your final report you will include your final completed work of the sections above and please change the section called ‘Preliminary Exploratory Data Analysis’ with ‘Exploratory Data Analysis’.

Your write up must also include a one to two page conclusion and discussion. This will require a summary of what you have learned about your research question. Provide suggestions for improving your analysis, and include a paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project.

The project is very open-ended. You should create some compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use, but sticking to packages we learned in class (tidyverse) is sufficient. Remember that you do not need to visualize all of the data at once.

The final report zipped folder

If you are doing this in a group then each of you will need to turn in the same .zipped folder per group.

At a minimum, the final zipped folder should have the file structure shown below (see this example)

├── data/           # data folder
│   ├── README.md   # data description and link to source
├── proposal/       # proposal folder
│   ├── ideas.Rmd   # RMarkdown file with proposal for project (yes this is what you will have already turned in)
├── project/        # place project files here
│   ├── report.Rmd  # RMarkdown file used to perform the analysis
├── presentation/   # place the .pdf or .Rmd with supporting images (so that it can be knit) or .pptx presentation here
├── project.Rproj   # .Rproj file for the project (optional but recommended)
└── README.md       # README file with contact information and brief description
  • If you find yourself in a situation that is difficult to resolve, ask questions to me as soon as possible.

Presentation

Final project presentations will be held Wed May 10 from 6:30 pm to 9:30 pm.

Presentations by each project needs to be between 8 and 10 minutes total. Please note if you are doing this in a group then this means that the entire presentation with each group member presenting is between 8-10 minutes total.

  • An important aspect of doing research is taking time to share your findings with others. I will give everyone time to share their project and summarize their findings.

  • The presentation (15% of final project grade) should consist of about 5–10 slides. You should briefly describe the research question, the data involved, the analysis performed, and the conclusions drawn from the analysis.


Deliverables

As I mentioned above

You will upload to Moodle your .zip file that contains the files and structure above. Your goal is to submit a cohesive project report that conveys that you have mastered some of the data science techniques that we have discussed in class. The Moodle turn-in link will be posted after your proposals are submitted.

  • The final project supporting documents are due by 11:59 PM Eastern on May 12. (We are not having a final exam; we are just using the final exam time to finish up presentations and wrap up.)

  • Upload your submission via Moodle in the corresponding assignment page for the final project. Each student will submit the project.

  • So this means if your project was done in a group then your submission for the final project turn in on or by May 12 needs to be the same across the group

Grading

  • Team peer evaluation: you will be asked to fill out a survey where you rate the contribution and teamwork of each team member.

Grading of the project will take into account the following:

  • Content: What is the quality of research and/or question and relevancy of data to those questions?

  • Correctness: Are data science procedures carried out and explained correctly?

  • Writing and Presentation: What is the quality of the presentation, writing, and explanations?

  • Creativity and Critical Thought: Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100%: Outstanding effort. Student understands how to apply data science concepts, can put the results into a clear and convincing argument, can identify weaknesses in the argument, and can communicate the results to others.

  • 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.

  • 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a coherent argument, and communication of results is sometimes unclear.

  • 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts. Communication of results is unclear.

  • Below 60%: Student is not making a sufficient effort.

Project Data

  • Familiarize with your data. Check if there is any missing information, understand the variables involved in the data set, and understand the nature of the data.

  • Recall to perform any relevant exploratory data analysis (EDA) and clearly explain your findings.

  • Include data visualizations that help you understand something in particular about your data.

  • Create summaries, comparisons, and any other type of quantitative analysis.

For this project you me as soon as possible to discuss the possibility of considering an alternative data source)

R4DS Tidy Tuesday

Data from Tidy Tuesday’s challenges. Check a list here. For example to use the data from the “Space Launches” repo, you could use

library(tidyverse)
launches <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-15/launches.csv")

pudding.cool data stories

Data from pudding.cool’s stories. Check a list of available datasets here For example, to read the data associated to the article “Greetings from Mars”, you could use

library(tidyverse)
mars <- read_csv("https://github.com/the-pudding/data/raw/master/mars-weather/mars-weather.csv")

Data Is Plural

You can search for a dataset of interest at the “Data Is Plural” archive website. This site is maintained by Jeremy Singer-Vine (a journalist, computer programmer, and data editor based in New York City).

Additional Data sets

Below is a list of additional data sources you may consider for your final project.

  • U.S. Government’s open data. Find a data set of interest at https://catalog.data.gov/dataset. There are thousands of data sets the Government makes available on topics including education, agriculture, law, safety, research, and more.

  • BuzzFeedNews’ index of all their open-source data, analysis, libraries, tools, and guides. Data, scripts, and related stories can be found at https://github.com/BuzzFeedNews/everything

  • ProPublica’s data store. Topics include transportation, criminal justice, environment, finance, among others. Look for free data sets available at https://www.propublica.org/datastore/datasets/

  • The Bureau of Transportation Statistics (BTS), part of the Department of Transportation (DOT) is the preeminent source of statistics on commercial aviation, multimodal freight activity, and transportation economics, and provides context to decision makers and the public for understanding statistics on transportation. Find information at https://www.bts.gov/browse-statistical-products-and-data

  • The U.S. Bureau of Economic Analysis, a source of accurate and objective data about the nation’s economy. Find information at https://www.bea.gov/data

  • National Center for Education Statistics (NCES) data products. NCES is the primary federal entity for collecting and analyzing data related to education in the U.S. and other nations. NCES is located within the U.S. Department of Education and the Institute of Education Sciences. Find more information at https://nces.ed.gov/datatools/

  • HealthData.gov. This site is dedicated to making data discoverable and making valuable government data available to the public in the hopes of better health outcomes for all. Explore datasets at https://healthdata.gov/