Course Coordinates

Taught at the University of Konstanz by Hansjörg Neth (h.neth@uni.kn, SPDS, office D507).
Summer 2018: Mondays, 13:30–15:00, C511.
Links to ZeUS and Ilias

Course Description

Overview

Students of psychology and other social sciences are trained to analyze data. But the data they learn to work with (e.g., in courses on statistics and empirical research methods) is typically provided to them and structured in a (rectangular or “tidy”) format that presupposes many steps of data processing regarding the aggregation and spatial layout of variables. When beginning to collect data from other sources, most students struggle with these pre-processing steps which — even for experienced data scientists — tend to require more time and effort than choosing and conducting statistical tests.

This course develops the foundations of data analysis that allow students to collect data from real-world sources and transform and shape such data to conduct reproducible research and answer scientific and practical questions. Although there are many good introductions to data science (we use Grolemund & Wickham, 2017) they typically do not take into account the special needs – and often anxieties and reservations – of psychology students. As social scientists are not computer scientists, we introduce new concepts and commands without assuming a mathematical or computational background. Adopting a task-oriented perspective, we begin with a specific problem and then solve it with some combination of data collection, manipulation, modeling, and visualization.

Goals

Our main goal is to develop a set of useful skills in analyzing real-world data and conducting reproducible research. Upon completing this course, you will be able to read, transform, analyze, and visualize data of various types. Many interactive exercises will allow students to check their understanding, monitor their progress, and practice their skills.

Requirements

This course assumes some basic familiarity with statistics and the R programming language, but enthusiastic programming novices are also welcome.

Attending class, continuos preparation (by working through chapters prior to class), and solving exercises (in class).

Assessment

Weekly exercises in class, mid-term test, and/or final test or project (which may be thesis-related).

On June 4, 2018, we will have a mid-term exam, consisting of MC questions and/or practical exercises (open book, 90 mins).

Final assessment: Vote on 1 of 2 options (after mid-term exam):

Final exam (same format as in mid-term test);
Final data science project (contents to be discussed with instructor).

Both components count towards your final grade (with the better component being systematically over-weighted).

Schedule

Note: The following schedule was revised on June 20th, 2018.

Individual sessions, dates, and topics

16.04.2018: Introduction: Chapters 1, 2, 4, 6, 8

Part 1: Explore data

23.04.2018: Chapter 3: Data visualisation: ggplot2.
30.04.2018: Chapter 5: Data transformation: dplyr.
07.05.2018: Chapter 7: Exploratory data analysis (EDA).

Part 2: Wrangle data

14.05.2018: Chapter 10: Tibbles: tibble.
21.05.2018: Holiday: No class Skipping Chapter 11: Data import: readr.
28.05.2018: Chapter 12: Tidy data: tidyr.
04.06.2018: Mid-term exam: Data transformation and visualization (open book, 90 mins).
11.06.2018: Chapter 13: Relational data: dplyr.

Part 3: Rinse and repeat

18.06.2018: Essentials of Chapter 10: Tibbles: tibble.
25.06.2018: Essentials of Chapter 5: Data transformation: dplyr.
02.07.2018: Essentials of Chapter 3: Data visualisation and Chapter 7: EDA: ggplot2 + dplyr.
09.07.2018: Essentials of Chapter 12: Tidy data: tidyr.
16.07.2018: Final exam: Transforming and visualizing data (open book, take at home).

All ds4psy essentials:

Nr.	Topic
1.	Creating and using tibbles
2.	Data transformation
3.	Visualizing data
4.	Tidy data

Structure of each week and session

This course mostly uses the so-called flipped classroom paradigm, in which students are solving exercises — in pairs or small groups — with additional guidance by the instructor (see Wikipedia for details).

Before class: You need to prepare every session at home by
1. reading the current chapter, and
2. preparing an .R or .Rmd script that includes all chapter code (without the exercises).
During class: Every class session consists of 2 parts:
1. Asking questions (plenary): To start each weekly session, the instructor introduces the current topic and answers questions on main concepts or commands.
2. Solving chapter exercises (in random dyads/triples): Copy exercise text into your scripts and solve exercises in a way that you can share with other members of the class (by posting your solutions on Ilias).

Details

This section contains some clarifications (on the term “data science”) and recommendations (on how to succeed in this course).

What is data science?

Data science as the art of dealing with data

In our technology-driven environment, data is cheap and ubiquitous. Every call, trip, or visit of a web page leaves a trail of data. Apart from raising important issues of privacy, the notion of big data highlights another problem: In an era in which digital information is collected everywhere, we are drowning in data that promises to be meaningful and valuable, but we often fail to understand. We typically lack the ability and the tools of dealing with data.

Thus, data science is relevant not only for natural and computer sciences, but also for social sciences, humanities, and even arts. It is partly a scientific enterprise, but also a skill, an art, and a craft. Like all skills, arts and crafts, data science requires special tools, experience in choosing and using them, and an awful lot of dedicated practice (see below).

Data science vs. statistics

There is considerable overlap between data science and statistics, but they are not identical.

Statistics is primarily a mathematical discipline that examines the properties of probabilty distributions and inferences from samples. Statistics typically involves formulas and numbers, but does not necessarily require getting your hands dirty with real-world data. In the context of psychology, statistics mostly quantifies differences and relationships between groups and tests effects of experimental manipulations or treatments.
Data science typically begins with messy data from real-world sources. Data literacy (as a basic ability to deal with data) is an essential prerequisite for applying statistics, but does not require statistics to yield meaningful results. Data science can be described as using statistics to solve real-world problems, but pursues a different main goal: understanding the data, rather than testing effects.

In short, statistics mostly summarizes data to test hypotheses, while data science transforms and visualizes data to promote the generation of hypotheses.¹ In science, both objectives are ubiquitous and complementary, but basic data literacy and data science enable us to understand and deal with data before and beyond statistics. Dealing with data in a variety of ways enables new insights (e.g., by visualizing properties and relationships) and allows us to think more clearly about the causes and implications of data.

Premises of this Course

Assumptions

This course assumes that you are motivated to learn its material and willing and able to invest the effort and time required to do so. Theoretically, you could work through the textbook by yourself, but this would be a tenacious and lonely endeavour. Thus, the instructor and other classmates are there to help you to stay focused, provide motivation and guidance, ask and answer questions, and allow you to continuously check your progress and understanding.

Succeeding in this course

The main ingredient for succeeding in this course is aptly summarized by the following quote:

Practise yourself, for Heaven’s sake, in little things,
and thence proceed to greater.
Epictetus Discourses Book I, 18

Getting good at data science is similar to learning an instrument or mastering some sport: You need some things (e.g., special equipment, various accessories, but also a suitable infrastructure) to get started. Once you have all this stuff, you need advice and instruction from experts and lots of practice. It always helps to have talent, but dedicated practice is essential even when you happen to be a genius.

Becoming proficient in data science requires both curiosity and routine:

Curiosity implies interest, motivation, and fun: If you really want to find out or achieve something (e.g., understand some data or conduct some analysis), you are willing to find out how this can be achieved and will overcome the obstacles that may appear along the way. Perceiving tasks as a challenge rather than a chore will allow you to enjoy the efforts invested, rather than suffering from them.
Routine implies discipline, stamina, and lots of practice. It’s impossible to acquire new skills without investing a lot of time and effort. Crucially, habitual practice (e.g., daily use of data science tools) helps developing various organizational skills (e.g., using keyboard shortcuts, naming objects, formatting code, and structuring files or projects) that are non-trivial and will profoundly affect your productivity.

Ultimately, practicing any art, craft or science is also a process of socialization: belonging to a community that shares the same goals, methods, and values.

Warning

Beware of side effects: Becoming a useR will profoundly transform your thinking — not just about code, but about the data and problems you are analyzing or visualizing.

Programming (in R or any other language) can be enjoyable, but also has addictive potential. Make sure to take regular breaks and keep focused on the questions behind and beyond the data science tools.

Links

Essential commands and examples