Data Mining (95-791 Z4)

Syllabus

Mini 4, Spring 2017

this syllabus is adapted from Dr. Dubrawski’s 95-791 Data Mining Syllabus

Course Instructor/Facilitator:
Karen (Lujie) Chen karenchen@cmu.edu
Teaching Assistant:
Yixin Qi yixinq@andrew.cmu.edu
Disha Gupta dishag@andrew.cmu.edu
Video Lecture Instructor:
Dr. Artur Dubrawski awd@cs.cmu.edu

Prerequisites

95-796 “Statistics for IT Managers” or instructor’s permission based on the student’s knowledge of fundamentals of probability and statistics. Previous experience with data analysis will be considered a plus, although it is not absolutely necessary.

Introduction

Data mining – intelligent analysis of information stored in data sets – has gained a substantial interest among practitioners in a variety of fields and industries. Nowadays, almost every organization collects data, which can be analyzed in order to support making better decisions, improving policies, discovering computer network intrusion patterns, designing new drugs, detecting credit fraud, making accurate medical diagnoses, predicting imminent occurrences of important events, monitoring and evaluation of reliability to preempt failures of complex systems, etc.

About the Video Lectures Instructor

Artur Dubrawski is a scientist and a practitioner. He has been researching machine intelligence and its applications for twenty five years. In the past, he has been affiliated with an advanced data mining firm, Schenley Park Research, and served as Chief Technology Officer at Aethon, a local high-tech company making autonomous delivery robots. Currently Dr. Dubrawski is a faculty at the CMU Robotics Institute, where he directs the Auton Lab : a data mining and machine learning research group. Auton Lab’s work has yielded multiple deployments of analytic solutions and software in various government and industrial applications.

About the course facilitator

Karen Chen is a PhD student in the information system program of Heinz College, she is also associated with Auton Lab under supervision of Dr. Dubrawski. Her research interest is in big data analytics, machine learning and data mining application, in particular, the modeling of temporal dynamics of real time sensor data with application to health care and education. Some of her work involved analyzing physiological signals from continuously monitored patients as well as psychological signals of emotion states from facial expression analysis. Before her PhD career, she worked as a research staff with the Auton Lab for about 10 years, working on a variety of data mining and analytics projects in areas of public health, food safety, health insurance and fuel efficiency. She holds MISM degree and M.S. in statistics, both from Carnegie Mellon University and B. Eng degree in business and computer science from Shanghai Jiaotong University in China.

Course Objectives

This course will provide participants with an understanding of fundamental data mining methodologies and with the ability to formulate and solve problems with them. Particular attention will be paid to practical, efficient and statistically sound techniques, capable of providing not only the requested discoveries, but also estimates of their utility. The lectures will be complemented with hands-on experience with data mining software, primarily R, to allow development of basic execution skills.

The scope of the course will cover the following groups of topics.

  1. Foundations. How to make data mining practical? (approximately 40% of class time)
  • Learning from data: why, what and how?
  • Fundamental tasks, issues and paradigms of learning models from data.
  • Real world data is noisy and uncertain. How much can we trust the results of our analyses?
  • Model selection
  • Reduction of dimensionality and data engineering
  • Measures of association between data attributes: information theoretic, correlational
  1. Pragmatic methodologies for mining data (approximately 60% of class time)
  • Predictive analytics: classification and regression
  • Cost-sensitive model selection using ROC approach
  • Compression of data and models for improved reliability, understandability, and tractability of large sets of highly dimensional data
  • Association rule learning and decision list learning, decision trees
  • Introduction to density estimation, anomaly detection, and clustering
  • Overview of mining complex types of data
  • Illustrative examples of real-world applications

Reading Material

Unfortunately, the ideal textbook for this course does not exist. Instead, we will use a selection of readings excerpted from a variety of sources. These readings are intended to complement the material presented in class. Selected issues covered by the required readings will become topics of graded assignments and final examination.

All required material will be distributed electronically through course site, or pointers to the resources available on the internet for free download. Note that many of the readings are protected under copyright law. In order to use them in this course it was necessary to purchase official permissions from the copyright holders. Each enrolled student could have their HUB account charged with an equal share of the copyright fees. Although the exact amount of the individual share is not known at the moment of writing this document, it is estimated to not exceed $30.00. Please note that it is illegal to distribute copies of the copyrighted materials without obtaining permissions from their legal owners.

Interested students are welcome to go beyond the scope of the required readings. In particular, the following books are recommended - but not required - listed in no particular order:
1. Hand, Mannila and Smyth: Principles of Data Mining, MIT Press, 2001.
2. Witten and Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000 (with newer editions avaiable).
3. Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer 2001
4. Mitchell: Machine Learning, McGraw-Hill, 1997.

Software and Hands-on Exercises

We will primarily rely on R free software to demonstrate and operationalize concepts presented during lectures. Students are expected to download and install the software, as well as learn basic usage skills on their own using tutorials available online. Appropriate resources will be recommended during the first lecture and/or recitations session.

Recitations will review concepts taught in lectures and connect them to homework problems through examples. Recitation sessions when software tools are introduced will provide hands-on-experience opportunity: the students will be asked to follow the presenter using their laptops and they will work on assigned exercises while in session.

Assignments and Deadlines

All homework assignments will be distributed electronically through the piazza. All reports (including homework) must be submitted electronically through blackboard. Late homework will be accepted until 24 hours past the hard deadline, but it will be subject to an automatic 50% grade reduction.

Starting from week2, each team is required to submit a project update see here for details. The deadline is the same as the homework deadlines.

submission deadlines

Grading

Grades will be based upon the results of four homework assignments, one analytical project.

The analytical projects will be conducted in small groups of students. Each team will analyze specific real-world data. The project will be graded based on a report presentation of the results. Details of project can be found here

The final grade for this course will is composed of following:
1. Homework (individual work, 4 times 12.5%) 50%
2. Analytical project (in teams)
–weekly progress (10%)
–final deliverables (40%)

Academic Integrity

Students are expected to strictly follow Carnegie Mellon University rules of academic integrity in this course. This means homework are to be the work of the individual student using only permitted material and without any cooperation of other students or third parties. It also means that usage of work by others is only permitted in the form of quotations and any such quotation must be distinctively marked to enable identification of the student’s own work and own ideas. All external sources used must be properly cited, including author name(s), publication title, year of publication, and a complete reference needed for retrieval. Regarding the group projects, the work should be the work of only the group members. In all their work students should not in any way rely on solutions to problems distributed in prior years or on the work of prior students or other current students. Violations will be penalized to the full extent mandated by the CMU policies. There will be no exceptions.