10/10/2015

The Role of Statistics in Data Science

  • Foundations of the skill set
    • Statistics
    • Linear Algebra
    • Programming
  • To be used for
    • Data Preparation and Munging
    • Modeling
    • Coding
    • Visualization
    • Communication

Statistical Inference

  • There are many many data-generating processes in our lives
  • Need to describe, understand, and make sense of these processes
  • This is part of the solution to problems we need to solve

Statistical Inference

  • Data is being produced all the time
  • Different types of data
  • With different purposes
  • There are data generating processes in our lives

Statistical Inference

  • We analyze data to:
    • Understand the data generating processes
    • Describe the data
    • Make sense of those processes
  • This may be part of the solution to a problem

Statistical Inference

  • If data is available
  • How can we collect it?
    • Sampling method

Statistical Inference - Sampling

Statistical Inference - Uncertainty

  • There are two types of uncertainty
    • That involved with the process
    • That related to the data collection method

Statistical Inference - Model

  • Data itself represents the world
  • Need a way to simplify this data
    • More comprehensible
    • More concise
    • Mathematical models or functions of the data: Statistical estimators
  • This process (from the world to data & viceversa) is the field of statistical inference

Statistical Inference - Defined

  • "Statistical Inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes."

  • "Statistical Inference is the process of generating conclusions about a population from a noisy sample"

  • The process of generating conclusions about a population from a noisy sample (Brian Caffo, 2015)
    • We generate new knowledge
    • We use probability models to connect our data and a population and to perform inference

Statistical Inference - Example

  • What are the more popular activities among the students community given a 100 students poll?
  • Who will win the election?
    • Poll the entire population?
    • Someone could change his mind and actually vote differently
    • How do we collect data?
    • How can we quantify the uncertainty in the process to produce a good guess about the winner?

Goals of Inference

  • Some of the goals of inference are (Caffo 2015)
    • Estimate and quantify uncertainty of an estimate of a population quantity
    • Determine whether a population quantity is a benchmark value ("is the treatment effective?")
    • Infer a mechanistic relationship when quantities are measured with noise ("What is the slope for Hooke's law?")
    • Determine the impact of a policy ("If we reduce pollution levels, will asthma rates decline?")
    • Talk about the probability that something occurs

Tools for Statistical Inference

  1. Randomization
  2. Random sampling
  3. Sampling models
  4. Hypothesis testing
  5. Confidence intervals
  6. Probability models
  7. Study design
  8. Nonparametric bootstrapping
  9. Permutation, randomization and exchangeability testing

Populations and Samples

  • We collect data from populations
  • Data we collect is our sample
  • Observations in our population: N
  • Observations in our sample: n
  • We take a sample to examine the observations in order to obtain conclusions and make inferences about the population

Our Sample

  • Different ways to obtain our sample

Our Sample

  • Important to know how the sample was obtained
    • Could have introduced biases to the data
    • Does not represent the population
    • Conclusions could be wrong

Populations and Samples of Big Data

  • What about samples when we are able to reccord "ALL" of the information all the "TIME"?
  • If we are able to process "ALL" the data, why should we work with samples?

  • How much data we need actually depends on the problem at hand
    • Data Analysis (sample)
    • Inference (sample)
    • Tune a specific user system (all the specific user's data)

Biases

  • Depending on the data we use, there are biases
    • If we have ALL the Facebook's, Google's, or Twitter's data corpus, we can't conclude about ALL people in general, only for users of Facebook's, Google's, or Twitter's, and only for the particular dates of the data

Biases

  • See the talk by Kate Crawford - Hidden Biases of Big Data
    • Bias - need to understand/be-aware of the context of data
    • Signal - what is represented/misrepresented/excluded in our data
    • Scale - are we loosing detail when we aggregate data?

Biases

  • Take tweets about Hurricane Sandy as an example
    • From tweets Sandy wouldn't look that bad
      • Most people suffering from the huricane could not tweet
    • You might conclude that this is how Sandy looked like for twetter users
      • Not a representative sample of the general US population

Samples in Context

  • Our domain data could be seen as a whole dataset or as a sample
  • Bone marrow - leukemia samples from Hospital A
    • The whole domain for a researcher
    • A sample for hospitals in the same area
    • What about the country?
    • What about a world scale?
  • Sampling Distribution

Samples for Different Types of Data

  • Need to feel confortable with different types of data
    • Traditional
    • Text
    • Records (user-level, events, json)
    • Geo-based location data (epidemiology)
    • Network
    • Sensor
    • Images
    • Hybrid
  • How to consider the relations of structural data when working with structural domains? - geographic or networking domains

Also to Consider

  • In big data N = ALL
    • Does ALL really represents an entire population?
    • Remember biases!
  • We should take into account causation in our models
  • Data is not always objective

What Is a Model?

  • (merriam webster)
    • A system of postulates, data, and inferences presented as a mathematical description of an entity or state of affairs; also: a computer simulation based on such a system

What Is a Model?

  • (textbook- Rachel Schutt & Cathy O'Neil)
    • A model is our attempt to understand and represent the nature of reality through a particular lens, be it architectural, biological, or mathematical.
    • A model is an artificial construction where all extraneous detail has been removed or abstracted. Attention must always be paid to these abstracted details after a model has been analyzed to see what might have been overlooked.
  • We need an abstraction of what we see to create our model

Creating a Model

  • To create our model we start by
    • Understanding our process
      • Order the steps of our process
      • Find what influences other steps
      • Find what causes what
      • Identify how we could test the model
    • Express the retionshps with math
    • Draw diagrams (data flow, causality,…)
      • Then choose equations to represent these relations

Statistical Model

Wikipedia

  • A statistical model embodies a set of assumptions concerning the generation of the observed data, and similar data from a larger population. A model represents, often in considerably idealized form, the data-generating process. The model assumptions describe a set of probability distributions, some of which are assumed to adequately approximate the distribution from which a particular data set is sampled.

Statistical Model

  • A model is usually specified by mathematical equations that relate one or more random variables and possibly other non-random variables. As such, "a model is a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).[1]

  • All statistical hypothesis tests and all statistical estimators are derived from statistical models. More generally, statistical models are part of the foundation of statistical inference.

How Do We Build a Model?

  • How do we know which functional form the data should take?
    • We really don't know
    • It's part art and part science
  • There is not a recipe to do this

How Do We Build a Model?

  • Where can we start
    • We can perform Exploratory Data Analysis (EDA)
      • Explore our dataset
      • Make plots: Histograms, Scaterplots, Regression lines (linear regression)
      • Start with simple analysis and then make them more complex
      • Think about what you are getting from the plots – analyze: Does it make sense?, What else can you do?
    • If we make assumptions, document them all

How Do We Build a Model?

  • Sometimes simple models are accurate enough
    • More complex models might not significantly increase accuracy

Probability

  • Probability distributions are the foundation of statistical models
    • i.e. linear regression
    • The law that assigns numbers to the long run occurrence of random phenomena after repeated unrelated realizations
      • Proportion of heads when we flip a coin
    • Randomness: a process occurring without apparent deterministic patterns. Then, we treat some variables as random when they are completely deterministic
  • Need to know basic probability

Kolmogorov's Three Rules

  • Given an experiment with a random outcome
    • The probability of an event is a non-negative real number between 0 and 1
    • The probability that some elementary event in the entire sample space will occur is 1 (there are no elementary events outside the sample space)
    • The probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive) is the sum of their respective probabilities

Kolmogorov's Three Rules

  • Note: mutually exclusive: 2 events are mutually exclusive if they cannot occur simultaneously (i.e. cannot get two different values at the same time with a die)

Consequences of The Three Rules

  • The probability that nothing occurs is 0
  • The probability that something occurs is 1
  • The probability of something is 1 minus the probability that the opposite occurs
  • The probability of at least one of two (or more) things that can not simultaneously occur (mutually exclusive) is the sum of their respective probabilities
  • For any two events the probability that at least one occurs is the sum of their probabilities minus their intersection

Consequences of The Three Rules

Random Variables

  • Think and model probabilities for numeric outcomes of experiments
    • Densities (i.e. bell curve) and mass functions for random variables
  • Remenber
    • Statistical Inference: describes populations using data
    • Probability: a way to mathematically characterize a population
      • As a probability density function
    • We assume that our sample is a random draw from the population

Random Variable

  • A random variable is a numerical outcome of an experiment
    • Discrete
      • Take a countable number of possibilities
      • A mass function assigns probabilities that they take specific values
      • i.e. tossing a coin, rolling a die, number of visits to a website each day (modeled with a Poisson distribution)

Random Variable

  • A random variable is a numerical outcome of an experiment
    • Continuous
      • Can take any value on the real line
      • Assign probability that their value is within some range
      • Densities characterize these probabilities
      • i.e. lengths, weights

Random Variable

  • Need mathematical functions to model the probabilities of random variables
    • Mass and densities functions
      • Take possible values of the random variables
      • Assign the associated probabilities
      • Describe the population of interest
      • i.e. the normal distribution

Probability Mass Functions (PMF)

  • A PMF evaluated at a value gives the probability that a random variable takes that value
  • A valid PMF satisfy
    • It must always be larger than or equal to 0
    • The sum of the possible values of the random variable add up to one

Probability Density Functions (PDF)

  • A PDF is a function associated with a continuous random variable
  • Central dogma of probability density functions:
    • Areas under PDFs correspond to probabilities for that random variable
  • Valid PDFs must satisfy:
    1. It must be larger than or equal to zero everywhere
    2. The total area under it must be one

Probability Density Functions (PDF)

Cumulative Distribution Fuction (CDF)

Quantiles

  • The \(\alpha^{th}\) quantile of a distribution with distribution function F is the point \(x_{\alpha}\) so that \(F(x_{\alpha}) = \alpha\)
    • The 0.95 quantile of a distribution is the point in which 95% of the mass density lies below it
  • A percentile is a quantile with \(\alpha\) expressed as a percent
  • The population median is the \(50^{th}\) percentile
  • Percentiles are not probabilities

Quantiles

Conditional Probability

  • Probabilities change as we obtain new information about a random variable
    • Probability of getting a one with a fair dice is one sixth
    • If we know that the obtained number was odd (1, 3, or 5), we condition on this new information and the probability of obtaining a one is one third
    • We condition on new information
  • Definition \(P(A|B) = \frac{P(A) \cap P(B)}{P(B)}\)
  • If \(A\) and \(B\) are independent \(P(A|B) = \frac{P(A)P(B)}{P(B)}= P(A)\)
    • If \(B\) happens, that does not give us any information about \(A\) happening

Bayes' Rule

  • Allows to reverse the conditioning set provided that we know some marginal probabilities

\(P(B|A) = \frac{P(A|B)P(B)}{P(A|B)P(B) + P(A|B^c)P(B^c)} = \frac{P(A|B)P(B)}{P(A)}\)

Diagnostic Tests

  • Definitions
    • \(+\) is the event that a result of a diagnostic test is positive
    • \(-\) is the event that a result of a diagnostic test is negative
    • \(D\) is the event that the subject of the test has the disease
    • \(D^c\) is the event that the subject of the test does not have the disease

Diagnostic Tests

  • Definitions
    • Sensitivity is the probability that the test is positive given that the subject is actually sick, \(P(+|D)\)
    • Specificity is the probability that the test is negative given that the is not actually sick, \(P(-|D^c)\)
    • Positive predictive value is the probability that the subject is sick given that the test is positive, \(P(D|+)\)
    • Negative predictive value is the probability that the subject is not sick given that the test is negative, \(P(D^c|-)\)
    • Prevalence of the disease is the marginal probability of disease, \(P(D)\)

Normal Distribution

  • Density
  • Cumulative
  • Frequency

Normal Distribution

Machine Learning (Separate Module)

  • Fitting a Model
  • Overfiting
  • Hypothesis Testing
  • Statistical Power

References

  • Material for this class was taken from different sources, mainly:
    • Statistical Inference for Data Science, Brian Caffo, Leanpub, 2015 .
    • Doing Data Science, Rachel Schutt and Cathy O'Neil, 2013.