Statistics & Data Analysis - 101

Mohar Guha
HealthTap

Making Data Meaningful

Present and describe information from data
- Exploratory Analysis, Statistical Summary
Estimating large population characteristics based on samples
- Point and Interval Estimates of Population Parameters
Make Decisions based on Samples
- Statistical Inference
Detect changes in a process
- Statistical Inference, Time Series Analysis
Obtain forecasts
- Regression, Time Series analysis

Descriptive Statistics: Feel for the data
- Measures of central tendency (mean, median, mode)
- Measures of dispersion (variance, dispersion)
- How many people bought a product? Customer profiles?

help decision makers

Example: Find the probablility that the first car I see in morning is a Tesla.

Example: Find the probablility that the first car I see in morning is a Tesla.

Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
- Probabilistic Reasoning - Know the population and predict the sample

Example: Find the probablility that the first car I see in morning is a Tesla.

Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
- Probabilistic Reasoning - Know the population and predict the sample
Scenario II : Do not have the information

Example: Find the probablility that the first car I see in morning is a Tesla.

Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
- Probabilistic Reasoning - Know the population and predict the sample
Scenario II : Do not have the information - Statistical Reasoning
Collect a random sample of $n$ cars in the street
Measure "how often" you see a Tesla \[\text{Relative Frequency}=\frac{f}{n}\]

Example: Find the probablility that the first car I see in morning is a Tesla.

Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
- Probabilistic Reasoning - Know the population and predict the sample
Scenario II : Do not have the information - Statistical Reasoning
Collect a random sample of $n$ cars in the street
Measure "how often" you see a Tesla \[\text{Relative Frequency}=\frac{f}{n}\]
As $n$ increases, \[\begin{eqnarray*} \text{Sample}&\rightarrow&\text{Population}\\ \text{Relative Frequency}&\rightarrow&\text{Probability} \end{eqnarray*}\]

plot of chunk unnamed-chunk-4

$X$: Number of heads in 10 tosses of a unbiased coin - Binomial Random Variable
$X$: number of phone calls arriving at your help desk in a 12-hour period - Poisson Random Variable

Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'. - Pearson
The position of a particle that experiences diffusion, exactly follows normal distribution
Logarithm of size of living tissue is assumed to follow a normal distribution
For large sample size, binomial and Poisson random variable follows approximately normal distribution

Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'. - Pearson
The position of a particle that experiences diffusion, exactly follows normal distribution
Logarithm of size of living tissue is assumed to follow a normal distribution
For large sample size, binomial and Poisson random variable follows approximately normal distribution
Normal Distribution is immensely popular due to Central Limit Theorem

Distribution of sum of large number of random variables will be approxmately normally distributed
Random variables ahould be independent and come from the same distribution
This result is true for NO matter what the underlying distribution is.
Application demonstrating Central Limit Theorem (Changes to be made to the app) 'http://guhapp.shinyapps.io/myapp/'

Take samples of size $n$ from a population with parameters $\mu$ (mean) and std deviation $\sigma$
The mean score $\bar{X}$ for each sample creates a sampling distribution of mean

Take samples of size $n$ from a population with parameters $\mu$ (mean) and std deviation $\sigma$
The mean score $\bar{X}$ for each sample creates a sampling distribution of mean
$E[\bar{X}]=\mu$ and $\text{SD}[\bar{X}]=\frac{\sigma}{\sqrt{n}}$

Take samples of size $n$ from a population with parameters $\mu$ (mean) and std deviation $\sigma$
The mean score $\bar{X}$ for each sample creates a sampling distribution of mean
$E[\bar{X}]=\mu$ and $\text{SD}[\bar{X}]=\frac{\sigma}{\sqrt{n}}$
For large enough $n$ the distribution of $\bar{X}$ is approximately Normal Distribution

If population distribution is normal- any sample size ($n>1$) works
If sampling distribution is symmetric, unimodal, without outliers - sample size is 15 or less
If sampling distribution is moderately skewed, unimodal, without outliers - sample size is between 16 and 40
Else sample size is greater than 40, without outliers