Course Instructor/Facilitator:
Karen (Lujie) Chen karenchen@cmu.edu
Teaching Assistant:
Pranav Bhatt pbhatt@andrew.cmu.edu
Video Lecture Instructor:
Dr. Artur Dubrawski awd@cs.cmu.edu
95-796 “Statistics for IT Managers” or instructor’s permission based on the student’s knowledge of fundamentals of probability and statistics. Previous experience with data analysis will be considered a plus, although it is not absolutely necessary.
Data mining – intelligent analysis of information stored in data sets – has gained a substantial interest among practitioners in a variety of fields and industries. Nowadays, almost every organization collects data, which can be analyzed in order to support making better decisions, improving policies, discovering computer network intrusion patterns, designing new drugs, detecting credit fraud, making accurate medical diagnoses, predicting imminent occurrences of important events, monitoring and evaluation of reliability to preempt failures of complex systems, etc.
Artur Dubrawski is a scientist and a practitioner. He has been researching machine intelligence and its applications for twenty five years. In the past, he has been affiliated with an advanced data mining firm, Schenley Park Research, and served as Chief Technology Officer at Aethon, a local high-tech company making autonomous delivery robots. Currently Dr. Dubrawski is a faculty at the CMU Robotics Institute, where he directs the Auton Lab : a data mining and machine learning research group. Auton Lab’s work has yielded multiple deployments of analytic solutions and software in various government and industrial applications.
Karen Chen is a PhD student in the information system program of Heinz College, she is also associated with Auton Lab under supervision of Dr. Dubrawski. Her research interest is in big data analytics, machine learning and data mining application, in particular, the modeling of temporal dynamics of real time sensor data with application to health care and education. Some of her work involved analyzing physiological signals from continuously monitored patients as well as psychological signals of emotion states from facial expression analysis. She is also interested in teacching problem solving skills in data science. Before her PhD career, she worked as a research staff with the Auton Lab for about 10 years, working on a variety of data mining and analytics projects in areas of public health, food safety, health insurance and fuel efficiency. She holds masters degree in information systems (MISM) and statistics, both from Carnegie Mellon University and B. Eng degree in business and computer science from Shanghai Jiaotong University in China.
This course will provide participants with an understanding of fundamental data mining methodologies and with the ability to formulate and solve problems with them. Particular attention will be paid to practical, efficient and statistically sound techniques, capable of providing not only the requested discoveries, but also estimates of their utility. The lectures will be complemented with hands-on experience with data mining software, primarily R and Python, to allow development of basic execution skills.
The scope of the course will cover the following groups of topics.
Unfortunately, the ideal textbook for this course does not exist. Instead, we will use a selection of readings excerpted from a variety of sources. These readings are intended to complement the material presented in class. Selected issues covered by the required readings will become topics of graded assignments and final examination.
All required material will be distributed electronically through course site, or pointers to the resources available on the internet for free download. Note that many of the readings are protected under copyright law. In order to use them in this course it was necessary to purchase official permissions from the copyright holders. Each enrolled student could have their HUB account charged with an equal share of the copyright fees. Although the exact amount of the individual share is not known at the moment of writing this document, it is estimated to not exceed $30.00. Please note that it is illegal to distribute copies of the copyrighted materials without obtaining permissions from their legal owners.
Interested students are welcome to go beyond the scope of the required readings. In particular, the following books are recommended - but not required - listed in no particular order:
1. Hand, Mannila and Smyth: Principles of Data Mining, MIT Press, 2001.
2. Witten and Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000 (with newer editions avaiable).
3. Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer 2001
4. Mitchell: Machine Learning, McGraw-Hill, 1997.
Along with the course, we will also provide additional optional readings/resources for the purpose of expanding the knowledge bases of the students.
We will primarily rely on R and Python to demonstrate and operationalize concepts presented during lectures. Students are expected to download and install the software, as well as learn basic usage skills on their own using tutorials available online.
All homework assignments will be distributed electronically through the Canvas. Homework must be submitted electronically through Canvas. In priciple, late homework will be accepted until 24 hours past the deadline, but it will be subject to an automatic 50% grade reduction, however, I do accommodate late submission for unexpected events, for example, job deadlines, family events or health issues, in which cases, please communicate with me on a timely manner.
Starting from week 2, at the end of each week, we will ask you to write one multiple choices question that you think will benefit students in the future while they’re viewing the lecture. Example for the write up of qustions will be provided. You will provide questions, choices and your propsoed answer and explanation. Your question and answer can be based on your own Q&A with the instruction team. We will provide feedback on your question. Good questions will be selected into the Quiz Bank to be used in the future edition of the class.
Starting from week2, each team is required to submit an weekly update on their project, see Link to projects for details.
Each team will deliver a project report presentable as “portfolio” toward the end of the course. In addition, each team will submit a caselet write up. see Link to projects for details.
Please refer to this spreadsheet for a one-stop information on lectures, homework and project deadlines Link to Weekly Schedule
Grades will be based upon the results of four homework assignments, analytical project and contribution to quiz bank.
The analytical projects will be conducted either among a group of 1-2 students. Each team will analyze specific real-world data. The project will be graded based on a report presentation of the results and the weekly interaction with the instruction team either in written or in oral format. Details of project can be found here
The final grade for this course will is composed of following:
1. Homework (individual work, 3 times 10%) 30%
2. Weekly contribution to Quiz Bank (individual work) 15%
3. Analytical project (in teams)
–weekly progress (15%)
–final deliverables (40% = project report 30% + caselet write up 10%)
Students are expected to strictly follow Carnegie Mellon University rules of academic integrity in this course. This means homework are to be the work of the individual student using only permitted material and without any cooperation of other students or third parties. It also means that usage of work by others is only permitted in the form of quotations and any such quotation must be distinctively marked to enable identification of the student’s own work and own ideas. All external sources used must be properly cited, including author name(s), publication title, year of publication, and a complete reference needed for retrieval. Regarding the group projects, the work should be the work of only the group members. In all their work students should not in any way rely on solutions to problems distributed in prior years or on the work of prior students or other current students. Violations will be penalized to the full extent mandated by the CMU policies. There will be no exceptions.
Do your best to maintain a healthy lifestyle this semester by eating well, exercising, avoiding drugs and alcohol, getting enough sleep and taking some time to relax. This will help you achieve your goals and cope with stress. All of us benefit from support during times of struggle. You are not alone. There are many helpful resources available on campus and an important part of the college experience is learning how to ask for help. Asking for support sooner rather than 1 later is often helpful. If you or anyone you know experiences any academic stress, difficult life events, or feelings like anxiety or depression, we strongly encourage you to seek support. Counseling and Psychological Services (CaPS) is here to help: call 412-268-2922 and visit their website at http://www.cmu.edu/counseling/. Consider reaching out to a friend, faculty or family member you trust for help getting connected to the support that can help.