Author: João Neto
LinkedIn: linkedin.com/in/joaonetoprofile

This text will introduce concepts about ethical issues in data science by exploring topics such as privacy (including GDPR), fairness, and accountability. Biases in data will also be touched upon.

Key words: Personal Data, Processing, privacy, ethics, bias, fairness, accountability, GDPR, Lawfulness

1. Introduction

The exponential increase in data volume and technological advancement has brought new challenges for both Data Scientists and organizations, particularly regarding legal and ethical considerations. Ethical considerations span across data collection, privacy, and model deployment, bringing challenges in data-driven decision-making and emphasizing the necessity for algorithmic transparency. Although progress has been made, challenges persist and require standardized ethical guidelines.

2. Data Ethics & Privacy (GDPR)

With data often tied to human information, there is a growing awareness of the ethical implications associated with its processing. Consequently, strict adherence to a global ethical code, coupled with compliance with regulations such as the General Data Protection Regulation (GDPR), is imperative to protect the rights and freedom of each person. The non-compliance of these ethical principles and regulations carries legal consequences and reputational damage for both individuals and organizations.

Specifically, the 4th, 5th, and 9th articles of GDPR hold significant importance for the daily work of a data scientist, guiding decisions on personal data collection and storage. The 4th Article defines personal data and processing, the 5th Article outlines seven principles regulating personal data processing, and the 9th Article declares situations where the processing of personal data is prohibited. Additionally, the 6th Article specifies conditions for lawful processing of personal data. Below, these GDPR articles are detailed.

2.1. GDPR Articles

4th Article: Definition of Personal Data and Processing

  • Personal data is “any information relating to the data subject, which allows for his direct or indirect identification, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person ” (source: https://www.legislation.gov.uk/eur/2016/679/article/4, paragraph 1).

  • Processing "means any operation or set of operations which is performed on personal data or sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction." (source: https://www.legislation.gov.uk/eur/2016/679/article/4, paragraph 2).
5th Article: Principles to Process Personal Data
This article outlines the seven principles that regulate the processing of personal data that shall be (source: http://www.legislation.gov.uk/eur/2016/679/article/5/2016-04-27):
  1. Transparent, lawful, and fair;
  2. Limited to its purpose;
  3. Minimal, adequate and necessary to the purpose defined (data minimisation);
  4. The data should be kept accurate and up-to-date, and not kept longer than needed; and
  5. The Integrity and Confidentiality of the data have to be warranted.
6th Article: Lawfulness of Processing
Processing shall be lawful only if at least one of the following applies (source: https://www.legislation.gov.uk/eur/2016/679/article/6, paragraph 1):
  1. the data subject has given consent for his or her personal data for one or more specific purposes;
  2. is necessary for the performance of a contract;
  3. is necessary for compliance with a legal obligation;
  4. is necessary to protect the vital interests of the data subject or another natural person;
  5. is necessary for the performance of a task carried out in the public interest or the exercise of official authority vested in the controller;
  6. is necessary for the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the interests or fundamental rights and freedoms of the data subject which require protection of personal data, in particular where the data subject is a child.
9th Article: Processing of Special Categories of Personal Data
Is prohibited to process personal data that “reveals racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data to uniquely identify a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation ”. (source: https://www.legislation.gov.uk/eur/2016/679/article/9). However, there are situations where this prohibition shall not apply (GDPR, Article 9, paragraph 2).

3. Fairness

Fairness is critical to ensure just and equitable outcomes, particularly when data-driven decision-making that impact individuals or communities. So, treat all individuals or groups impartially, avoid discrimination and ensure the benefits and risks are distributed farily among different groups of study leads a relevant importance. The principles of fairness encompass:
  • Equality: Treating all individuals without discrimination and ensuring equal treatment;
  • Equity
  • Transparency: Clear understanding how the systems make decisions; and
  • Inclusivity: Considering diverse perspectives and ensuring representation in the data.
In practice, dealing with fairness faces challenges such as Bias in Data (if training data contains biases), complexity of models (complex AI models could lead to a difficult interpretation of outcomes), and context. Fairness is context-dependent and varies across different applications and user groups.

Some common metrics allow assess Fairness. However, those metrics depend on the context and goals, and a ongoing monitoring is crucial to detect and rectify bases that could emerge over time.
  • Demographic parity: Ensuring equal opportunities for different groups
  • Equalized odds: Balancing true positive and false positive rates among different groups, and
  • Disparate impact: Identifying and addressing discriminatory effects on specific groups.

3.1. Data Bias as a Fairness challenge

Data bias significantly impacts decision-making in Data Science. Data is considered biased when the sample in a study/model does not represent the population of the phenomenon of interest.

Addressing bias is challenging and lacks a straightforward solution. Merely removing potentitially biased variables may not be feasible, as they could also carry predictive value. This approach could potentially leave us with no variables to work with. Additionally, some variables can serve as "proxies" for sensitive characteristics. For example, if ethnicity or religion are deleted from a model but the names remain, names can inadvertently act as proxies for ethnicity due to their cultural associations.

Efforts to mitigate bias include adversarial training and fairness-aware machine learning techniques.

4. Accountability

Accountability in Data Science projects spans across development, deployment, and ongoing management phases. This responsibility is shared among Data Scientists, organizations, regulators, and policymakers. Data scientists ensure ethical practices, organizations cultivate responsible cultures, and regulators establish legal frameworks. Overall, the collective efforts aim to align ethical standards and fair decision-making expectations.