Data Mining (95-791 Z4)

Project Overview

The analytical project in this course is a great opporunity for you to apply the knowledge of data mining and R/Python programming skills to a real life analytical problem. Different from the data sets you encounter in homework, the data sets we use in this project are often messy, dirty and uncertain, and problem are not necessairly well defined. This in fact reflects the reality of the problem we’re trying to solve in the real world.

Project Selection

Option A

Pick one project from 8 projects available from here link. Those projects are with specific datasets and tasks.

Option B

You may also choose to work with your own dataset in the domain you’re interested in. In which case, please send me a paragraph describing the dataset and problems you plan to explore during the first week. We may need to redefine/rescrope the problems so that it is manageable within the scope of the course. If, howevever, we decide that your proposed project is not suitable for this course (e.g. due to the unavailabity or low quality of the dataset or the tools required are out of the scope of the course), you may fall back to Option A.

Here is the link to a few domains and public datasets for your reference. However, feel free to explore in the domain you find most interest.

Teaming

You will work in a team of 1-2 members. We will adjust the expectation of amount of work according to the team size. You may try to team up yourself, or you may use the “Search for Teammate” function in Piazza to find a team mate.

Deadline for Selection and Teaming Decisions

The project selection and teaming decisions needs to be finalized by the end of 1st week (March 25), however you’re strongly encouraged to decide early to start early. Please fill the forms for signing up here. In this form, you will be asked to provide following information of your project.

  • Project Name
  • Project Description (short phrase)
  • Team Name
  • Team Members (Name(s) and andrew id(s))
  • Google doc links for ongoing communication between team members and us
  • Team members’ relevant experience, if any (e.g. domain expertise, analytical or programming experience)
  • Who is the client (e.g. management of a company, policy maker etc.)
  • Value Proposition (what kind of value you believe the output of this project might add for your client, it might just a speculation at this point)

Weekly progreess update

Starting from second week, your team will be asked to submit a progress update in a shared google doc with the instruction team. In this update, you will write a paragraph summarizing following: * what has been done? * what are the obstacles? * what is your plan for next week?

Instructors and TA will use this doc as a onging communication tool with your team during the project phase, giving you advices and guidance.

Mentoring Interaction

Just like a real life consulting project, you’re strongly encouraged to interact with “client” to seek clarification or expectation during the course of your project. At the same time, we’re also the mentors for your project to help you to deliver a successful project. Depending on your preference, you may interact through forum Q&A (please use your specific project tag) for adhoc qusetions or schedule video conference (such as google hangout/skype) at times mutually convienient, in which case, please give us advance notice.

Project deliverable

Report

It is highly recommeneded that you present your result using RMarkdown or Jupyter Notebook.

Those are great tools to seemingly integrate your analysis in R/python with the report production process. If you use RMarkdown, you may submit your report in html (in which case you may publish your report to Rpubs.com). Please refer to RMarkdown cheat sheet. If you use Python, you may publish using gist

This report can become part of your “data science” portfolio in the future.

Your report should include following sections:
* Abstract
* Introduction
* Method and result
* Conclusion and furture work
* Your take away from this project (a refelection on project execution, team work, what work, what doesn’t, how would you have done to make it a better project etc.)

Caselet write up

We are building a repository of caselets (caselet = small case study) to help the beginner data scientists to build up data science problem solving skills using authentic problems and data sources. The caselets will be deployed in an online learning environment where timely feedback and explanation will be provided when users work through the problems. This is part of the CMU Simon Initiative funded project on “Accelerated Apprenticeship” with the goal to teach data science problem solving skills at scale.

A caselet includes following components:

  • Problem context
  • data description (in the form of tabular summary or plots, template will be provided)
  • a list of questions (5-7) multiple choices questions with correct answers and explanations provided a sample caselet write up can be found here. Link

A good caselet will reflect the confusing, pain points or common mistakes a beginner data scientist will typically encouter. Good questions may arise from your own journey of discovery, or with the interactions with mentors.

High quality caselets will be invited to be included into the caselet repository which, after being put into production, will be used by a larger community of data science learners within or outside Carnegie Mellon.

Your caselet will be submitted as a shared google doc.

Deadline

The deadline for submitting both final report and caselet write up is midnight of May 13th.

Grading Criteria

Project Report

  • The understanding of problem
  • The appropraiteness of the method
  • Interpretation of results
  • Clarity of reporting writing
  • Creativity/insight

Caselet Writeup

  • Clarify in presentation
  • Correctness in answer
  • Appropriateness in explanation

All team members receive the same grade unless there is request to grade independently in case of severe unbalanced work load.

Advice

  1. Decide Early! Start Early!
  2. Think along the course what kind of method is applicable and incrementally build up your analysis(you’re welcomed to share with us your work-in-progress and receive feedback)
  3. Enjoy the immersive experience!