Making Data Meaningful
Present and describe information from data
Draw conclusions about large population based on samples
Make decisions / Conclusions
Detect changes in a process
Obtain forecasts
Mohar Guha
HealthTap
Present and describe information from data
Draw conclusions about large population based on samples
Make decisions / Conclusions
Detect changes in a process
Obtain forecasts
Present and describe information from data
Estimating large population characteristics based on samples
Make Decisions based on Samples
Detect changes in a process
Obtain forecasts
Descriptive Statistics: Feel for the data
help decision makers
Example: Find the probablility that the first car I see in morning is a Tesla.
Example: Find the probablility that the first car I see in morning is a Tesla.
Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
Example: Find the probablility that the first car I see in morning is a Tesla.
Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
Scenario II : Do not have the information
Example: Find the probablility that the first car I see in morning is a Tesla.
Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
Scenario II : Do not have the information - Statistical Reasoning
Collect a random sample of \(n\) cars in the street
Measure "how often" you see a Tesla \[\text{Relative Frequency}=\frac{f}{n}\]
Example: Find the probablility that the first car I see in morning is a Tesla.
Scenario I: Suppose I exactly know the proportion of car makes in California - compute the probability exactly.
Scenario II : Do not have the information - Statistical Reasoning
Collect a random sample of \(n\) cars in the street
Measure "how often" you see a Tesla \[\text{Relative Frequency}=\frac{f}{n}\]
As \(n\) increases, \[\begin{eqnarray*} \text{Sample}&\rightarrow&\text{Population}\\ \text{Relative Frequency}&\rightarrow&\text{Probability} \end{eqnarray*}\]
\(X\): Number of heads in 10 tosses of a unbiased coin - Binomial Random Variable
\(X\): number of phone calls arriving at your help desk in a 12-hour period - Poisson Random Variable
Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'. - Pearson
The position of a particle that experiences diffusion, exactly follows normal distribution
Logarithm of size of living tissue is assumed to follow a normal distribution
For large sample size, binomial and Poisson random variable follows approximately normal distribution
Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'. - Pearson
The position of a particle that experiences diffusion, exactly follows normal distribution
Logarithm of size of living tissue is assumed to follow a normal distribution
For large sample size, binomial and Poisson random variable follows approximately normal distribution
Normal Distribution is immensely popular due to Central Limit Theorem
Distribution of sum of large number of random variables will be approxmately normally distributed
Random variables ahould be independent and come from the same distribution
This result is true for NO matter what the underlying distribution is.
Application demonstrating Central Limit Theorem (Changes to be made to the app) 'http://guhapp.shinyapps.io/myapp/'
Take samples of size \(n\) from a population with parameters \(\mu\) (mean) and std deviation \(\sigma\)
The mean score \(\bar{X}\) for each sample creates a sampling distribution of mean
Take samples of size \(n\) from a population with parameters \(\mu\) (mean) and std deviation \(\sigma\)
The mean score \(\bar{X}\) for each sample creates a sampling distribution of mean
\(E[\bar{X}]=\mu\) and \(\text{SD}[\bar{X}]=\frac{\sigma}{\sqrt{n}}\)
Take samples of size \(n\) from a population with parameters \(\mu\) (mean) and std deviation \(\sigma\)
The mean score \(\bar{X}\) for each sample creates a sampling distribution of mean
\(E[\bar{X}]=\mu\) and \(\text{SD}[\bar{X}]=\frac{\sigma}{\sqrt{n}}\)
For large enough \(n\) the distribution of \(\bar{X}\) is approximately Normal Distribution
If population distribution is normal- any sample size (\(n>1\)) works
If sampling distribution is symmetric, unimodal, without outliers - sample size is 15 or less
If sampling distribution is moderately skewed, unimodal, without outliers - sample size is between 16 and 40
Else sample size is greater than 40, without outliers
Is sample mean (\(\bar{X}\)) the best estimate of \(\mu\) ? Interval Estimate
Or use the sample mean (\(\bar{X}\)) and provide an interval centered around \(\bar{X}\) of likely values of the population mean \(\mu\)?
Ghirardelli Chocolate Company claims that a 20oz gift bag contains 50 squares. To test the claim we administer a study by sampling 60 bags out of 200 sent by the company.
Ghirardelli Chocolate Company claims that a 20oz gift bag contains 50 squares. To test the claim we administer a study by sampling 60 bags out of 200 sent by the company.
Ghirardelli Chocolate Company claims that a 20oz gift bag contains 50 squares. To test the claim we administer a study by sampling 60 bags out of 200 sent by the company.
Estimate the average number of squares \(\mu\) in the "population" of all gift bags.
Sample mean of \(\bar{X}\) is $\bar{x}=$47.72 (not good)
Ghirardelli Chocolate Company claims that a 20oz gift bag contains 50 squares. To test the claim we administer a study by sampling 60 bags out of 200 sent by the company.
Estimate the average number of squares \(\mu\) in the "population" of all gift bags.
Sample mean of \(\bar{X}\) is $\bar{x}=$47.72 (not good)
Sample variance \(s^2=\) 9.58
Ghirardelli Chocolate Company claims that a 20oz gift bag contains 50 squares. To test the claim we administer a study by sampling 60 bags out of 200 sent by the company.
Estimate the average number of squares \(\mu\) in the "population" of all gift bags.
Sample mean of \(\bar{X}\) is $\bar{x}=$47.72 (not good)
Sample variance \(s^2=\) 9.58.
Confidence Interval : \((47.72-\rm error,47.72+\rm error)\)
How good is the sample mean estimate \(\bar{x}\) in estimating \(\mu\)?
If the sample size \(n\) is large, then the estimate is good - \[\text{error}\propto\frac{1}{n}\]
How good is the sample mean estimate \(\bar{x}\) in estimating \(\mu\)?
If the sample size \(n\) is large, then the estimate is good - \[\text{error}\propto\frac{1}{n}\]
If the variance in the sample is high, then estimate is not good - \[\text{error}\propto\frac{s^2}{n}\]
How good is the sample mean estimate \(\bar{x}\) in estimating \(\mu\)?
If the sample size \(n\) is large, then the estimate is good - \[\text{error}\propto\frac{1}{n}\]
If the variance in the sample is high, then estimate is not good - \[\text{error}\propto\frac{s^2}{n}\]
Close enough, \[\text{error}=z^{*}\frac{s}{\sqrt{n}},\] where the \(z^*\) is a value from standard normal distribution.
By Central Limit Theorem \(\bar{X}\) is \(N(\bar{x},\frac{s}{\sqrt{n}})\).
\(P(\bar{X}>|z|)=0.5 \implies P(\frac{\bar{X}-\bar{x}}{s/\sqrt{n}}>\left|\frac{z-\bar{x}}{s/\sqrt{n}}\right|)=0.5\implies \frac{z-\bar{x}}{s/\sqrt{n}}=\pm z^{*}_{0.25}=\pm 1.96\)
By Central Limit Theorem \(\bar{X}\) is \(N(\bar{x},\frac{s}{\sqrt{n}})\).
\(P(\bar{X}>|z|)=0.5 \implies P(\frac{\bar{X}-\bar{x}}{s/\sqrt{n}}>\left|\frac{z-\bar{x}}{s/\sqrt{n}}\right|)=0.5\implies \frac{z-\bar{x}}{s/\sqrt{n}}=\pm z^{*}_{0.25}=\pm 1.96\)
95% confidence ineterval for the mean is \[\bar{x}\pm z^{*}\frac{s}{\sqrt{n}}= \left(49.22 - 1.96\frac{\sqrt{9.58}}{\sqrt{60}},49.22 + 1.96\frac{\sqrt{9.58}}{\sqrt{60}}\right)=(48.44,50)\]
By Central Limit Theorem \(\bar{X}\) is \(N(\bar{x},\frac{s}{\sqrt{n}})\).
\(P(\bar{X}>|z|)=0.5 \implies P(\frac{\bar{X}-\bar{x}}{s/\sqrt{n}}>\left|\frac{z-\bar{x}}{s/\sqrt{n}}\right|)=0.5\implies \frac{z-\bar{x}}{s/\sqrt{n}}=\pm z^{*}_{0.25}=\pm 1.96\)
95% confidence ineterval for the mean is \[\bar{x}\pm z^{*}\frac{s}{\sqrt{n}}= \left(49.22 - 1.96\frac{\sqrt{9.58}}{\sqrt{60}},49.22 + 1.96\frac{\sqrt{9.58}}{\sqrt{60}}\right)=(48.44,50)\]
If we repeat the following experiment 100 times
If we repeat the following experiment 100 times
Out of 100 confidence intervals only 95 of them contain the population parameter.
If we repeat the following experiment 100 times
Out of 100 confidence intervals only 95 of them contain the population parameter.
Calculation of confidence intervals used normal distribution (justified by CLT)
Assumptions that the sample size is large and population variance is known
How accurately do you need the answer?
What level of confidence do you intend to use?
Any historical knowledge about the data?
How accurately do you need the answer?
"We need a margin of error less than 4%"
What level of confidence do you intend to use?
"95% confidence interval"
Any historical knowledge about the data?
"From previous study the variance of the number of chococate squares is around 3"
How accurately do you need the answer?
"We need a margin of error less than 4%"
What level of confidence do you intend to use?
"95% confidence interval"
Any historical knowledge about the data?
"From previous study the variance of the number of chococate squares is around 3"
Answer: \(\text{SE}=z^*\frac{s}{\sqrt{n}}\leq 0.04\implies n\geq 7203\)
Find out if the data confirm a specific hypothesis.
Null Hypothesis : \(H_{0}\) - status quo - initially assumed true
Alternative Hypothesis : \(H_{A}\) - the researcher's proposal - what you hope to show.
Find out if the data confirm a specific hypothesis.
Null Hypothesis : \(H_{0}\) - status quo - initially assumed true
Alternative Hypothesis : \(H_{A}\) - the researcher's proposal - what you hope to show.
Main idea : Reject the null hypothesis in favor of the alternative only with significant evidence.
Consider mistakes in jury trial.
Null Hypothesis \(H_0\): The victim is not guilty.
Innocent man is pronounced guilty: Reject \(H_0\) when it is true : TYPE I ERROR
Consider mistakes in jury trial.
Null Hypothesis \(H_0\): The victim is not guilty.
Innocent man is pronounced guilty: Reject \(H_0\) when it is true : TYPE I ERROR
Probability of Type I Error = \(\alpha\) = SIGNIFICANCE LEVEL
Consider mistakes in jury trial.
Null Hypothesis \(H_0\): The victim is not guilty.
Innocent man is pronounced guilty: Reject \(H_0\) when it is true : TYPE I ERROR
Probability of Type I Error = \(\alpha\) = SIGNIFICANCE LEVEL
Guilty man is pronounced innocent : Accept \(H_0\) when it is false : TYPE II ERROR
Consider mistakes in jury trial.
Null Hypothesis \(H_0\): The victim is not guilty.
Innocent man is pronounced guilty: Reject \(H_0\) when it is true : TYPE I ERROR
Probability of Type I Error = \(\alpha\) = SIGNIFICANCE LEVEL
Guilty man is pronounced innocent : Accept \(H_0\) when it is false : TYPE II ERROR
1- Probability of Type II Error: (reject \(H_0\) when false) = \(\beta\) = POWER
Define parameter
Give null and alternative hypothesis
Select significance level \(\alpha\) (typical values 0.05, 0.01, 0.10)
Give test statistic formula - \(\frac{\text{Expected}-\text{Obsereved}}{\text{Standard Error}}\)
Verify the conditions of the test
Compute p- value - Probability of getting a value as extreme as the test statistic, assuming \(H_0\) is true.
State conclusion - If p-value \(\leq \alpha\) , significant evidence to reject \(H_0\) else fail to reject \(H_0\).
In a study of math students at a high school, the backgrounds of students successful in entry-level courses were checked. For 30 students from city backgrounds, the average test score was 78 with a standard deviation of 10; and for 25 students from rural backgrounds, the average test score was 85 with a standard deviation of 15. Is there evidence that the average test score is same for both group of students?
In a study of math students at a high school, the backgrounds of students successful in entry-level courses were checked. For 30 students from city backgrounds, the average test score was 78 with a standard deviation of 10; and for 25 students from rural backgrounds, the average test score was 85 with a standard deviation of 15. Is there evidence that the average test score is same for both group of students?
\(\mu_1\) be the mean test score of city students who succeed in the course, and \(\mu_2\) be the mean test score of all rural students who succeed.
\(H_{0}:\mu_{1}-\mu_{2}=0\), \(H_{A}:\mu_{1}- \mu_{2}\neq 0\)
In a study of math students at a high school, the backgrounds of students successful in entry-level courses were checked. For 30 students from city backgrounds, the average test score was 78 with a standard deviation of 10; and for 25 students from rural backgrounds, the average test score was 85 with a standard deviation of 15. Is there evidence that the average test score is same for both group of students?
\(\mu_1\) be the mean test score of city students who succeed in the course, and \(\mu_2\) be the mean test score of all rural students who succeed.
\(H_{0}:\mu_{1}-\mu_{2}=0\), \(H_{A}:\mu_{1}- \mu_{2}\neq 0\)
For this analysis set \(\alpha=0.10\)
In a study of math students at a high school, the backgrounds of students successful in entry-level courses were checked. For 30 students from city backgrounds, the average test score was 78 with a standard deviation of 10; and for 25 students from rural backgrounds, the average test score was 85 with a standard deviation of 15. Is there evidence that the average test score is same for both group of students?
\(\mu_1\) be the mean test score of city students who succeed in the course, and \(\mu_2\) be the mean test score of all rural students who succeed.
\(H_{0}:\mu_{1}-\mu_{2}=0\), \(H_{A}:\mu_{1}- \mu_{2}\neq 0\)
For this analysis set \(\alpha=0.10\)
\(\rm{SE}=\sqrt{\frac{s_{1}^2}{n_1}+\frac{s_{2}^2}{n_2}}=3.51\) and \(t=\frac{\bar{\mu}_{1}-\bar{\mu}_{2}-0}{\rm{SE}}=-1.99\) with \(\rm{df}=40.47\).
In a study of math students at a high school, the backgrounds of students successful in entry-level courses were checked. For 30 students from city backgrounds, the average test score was 78 with a standard deviation of 10; and for 25 students from rural backgrounds, the average test score was 85 with a standard deviation of 15. Is there evidence that the average test score is same for both group of students?
\(\mu_1\) be the mean test score of city students who succeed in the course, and \(\mu_2\) be the mean test score of all rural students who succeed.
\(H_{0}:\mu_{1}-\mu_{2}=0\), \(H_{A}:\mu_{1}- \mu_{2}\neq 0\)
For this analysis set \(\alpha=0.10\)
\(\rm{SE}=\sqrt{\frac{s_{1}^2}{n_1}+\frac{s_{2}^2}{n_2}}=3.51\) and \(t=\frac{\bar{\mu}_{1}-\bar{\mu}_{2}-0}{\rm{SE}}=-1.99\) with \(\rm{df}=40.47\).
P-value \(= P(t<-1.99)+P(t>1.99)=0.054\)
In a study of math students at a high school, the backgrounds of students successful in entry-level courses were checked. For 30 students from city backgrounds, the average test score was 78 with a standard deviation of 10; and for 25 students from rural backgrounds, the average test score was 85 with a standard deviation of 15. Is there evidence that the average test score is same for both group of students?
\(\mu_1\) be the mean test score of city students who succeed in the course, and \(\mu_2\) be the mean test score of all rural students who succeed.
\(H_{0}:\mu_{1}-\mu_{2}=0\), \(H_{A}:\mu_{1}- \mu_{2}\neq 0\)
For this analysis set \(\alpha=0.10\)
\(\rm{SE}=\sqrt{\frac{s_{1}^2}{n_1}+\frac{s_{2}^2}{n_2}}=3.51\) and \(t=\frac{\bar{\mu}_{1}-\bar{\mu}_{2}-0}{\rm{SE}}=-1.99\) with \(\rm{df}=40.47\).
P-value \(= P(t<-1.99)+P(t>1.99)=0.054\)
Interpret results: Since P-value is less than the significance level \(\alpha=0.10\), we have statistically significant evidence to reject the null hypothesis.
P-value very small indicates that the observed effect (null hypothesis) is unlikely to have occured purely by chance
P-value is moderately large then it is incorrect to interpret - There is evidence that intervention (alternate hypothesis) has no effect, alternative is much more plausible given the data.
Compute the \(100(1-\alpha)\%\) confidence interval for the difference in mean
Check if the hypothesized value is in the interval
\(100(1-\alpha)\%\) confidence interval gives the range of values that should not be rejected using a \(\alpha\) level test.
If P-value of the test is less than \(\alpha\) (it is significant), the confidence interval will NOT contain the hypothesized mean.
If P-value of the test is greater than \(\alpha\) (it is not significant), the confidence interval will contain the hypothesized mean.
Each of the two populations being compared should follow a normal distribution.
Tests to check normality
Each of the two populations being compared should follow a normal distribution.
Tests to check normality
Data for classical t-tests should be sampled independently from the two populations being compared.
Try to investigate the reasons for non normality- outliers, sampling error
Depending on the shape of the distribution of the sample, log, square root or reciprocal transformation can be made to reduce skewness of the data.
For non normal data conduct non-parametric tests (Mann Whitney)
Consider conversion rates \(c_A\) and \(c_B\) for two versions \(A\) and \(B\), green versus red signup button.
Simulate two experiment with population conversion rates \(c_{A}=0.51\) and \(c_B = 0.55\)
set.seed(456)
c_A=0.5
c_B=0.55
n=500
x.A=sum(rbinom(n,1,c_A))
x.B=sum(rbinom(n,1,c_B))
pval1=prop.test(c(x.A,x.B),c(n,n))$p.value
The two sided proportion test gives us a p-value 0.658, we cannot reject null hypothesis at 5% significance level.
We repeat the experiment for 5000 points and we get a significant p-value 3.6271 × 10-10.
What is the optimal \(n\)?
How many samples do we need to detect the difference?
Set a level for power, say 80% rejecting the null hypothesis when false.
Say \(p_1\) is the current conversion rate and \(p_2\) is the effect you wish to detect.
test<- power.prop.test(p1=0.5, p2=0.55, sig.level=0.05, power=0.8)
The sample size needed to pickup 10% increase in conversion rate at a 5% significance level and 80% power of test is 1565.
Using frequentist test we can make a statement like:
We cannot make a statement like:
We can take a Bayesian approach.
Using frequentist test we can make a statement like:
We cannot make a statement like:
We can take a Bayesian approach.
Explanation courtsey stackexchange
http://stats.stackexchange.com/questions/22/bayesian-and-frequentist-reasoning-in-plain-english
I have misplaced my phone somewhere in the home. I can use the phone locator on the base of the instrument to locate the phone and when I press the phone locator the phone starts beeping.
Problem: Which area of my home should I search?
I have misplaced my phone somewhere in the home. I can use the phone locator on the base of the instrument to locate the phone and when I press the phone locator the phone starts beeping.
Problem: Which area of my home should I search?
Frequentist Reasoning: I can hear the phone beeping. I also have a mental model which helps me identify the area from which the sound is coming from. Therefore, upon hearing the beep, I infer the area of my home I must search to locate the phone.
I have misplaced my phone somewhere in the home. I can use the phone locator on the base of the instrument to locate the phone and when I press the phone locator the phone starts beeping.
Problem: Which area of my home should I search?
Frequentist Reasoning:
I can hear the phone beeping. I also have a mental model which helps me identify the area from which the sound is coming from. Therefore, upon hearing the beep, I infer the area of my home I must search to locate the phone.
Bayesian Reasoning:
I can hear the phone beeping. Now, apart from a mental model which helps me identify the area from which the sound is coming from, I also know the locations where I have misplaced the phone in the past. So, I combine my inferences using the beeps and my prior information about the locations I have misplaced the phone in the past to identify an area I must search to locate the phone.
Let us say a man rolls a six sided die and it has outcomes 1, 2, 3, 4, 5, or 6. Furthermore, he says that if it lands on a 3, he'll give you a free text book.
Let us say a man rolls a six sided die and it has outcomes 1, 2, 3, 4, 5, or 6. Furthermore, he says that if it lands on a 3, he'll give you a free text book.
Let us say a man rolls a six sided die and it has outcomes 1, 2, 3, 4, 5, or 6. Furthermore, he says that if it lands on a 3, he'll give you a free text book.
Frequentist : Each outcome has an equal 1 in 6 chance of occurring. Probability is viewed as long run frequency distributions.
Bayesian : Hang on a second, I know that man, he's David Blaine, a famous trickster! I have a feeling he's up to something. I'm going to say that there's only a 1% chance of it landing on a 3 BUT I'll re-evaluate that belief and change it the more times he rolls the die. If I see the other numbers come up equally often, then I'll iteratively increase the chance from 1% to something slightly higher, otherwise I'll reduce it even further. Probability is viewed as degrees of belief in a proposition.
P value is
(A) The ctritical value of a test
(B) The estimate of the population parameter
(C) Probability that the null hypothesis is true
(D) Percentages of experiments in which the sample differences would be larger or smaller than we observed.
A biologist has taken a random sample of a specific type of fish from a large lake. A 95 percent confidence interval was calculated to be [5.6,8] pounds. Which of the following is true?
(A) 95 percent of all the fish in the lake weigh between 5.6 and 8 pounds.
(B) In repeated sampling, 95 percent of the sample proportions will fall within 5.6 and 8 pounds.
(C) In repeated sampling, 95% of the time the true population mean of fish weights will be equal to 6.8 pounds.
(D) In repeated sampling, 95% of the time the true population mean of fish weight will be captured in the constructed interval.
(E) We are 95 percent confident that all the fish weigh less than 8 pounds in this lake.
A manufacturer claims that a particular automobile model will get 50 miles per gallon on the highway. The researchers at a consumer-oriented magazine believe that this claim is high and plan a test with a simple ramdom sample of 30 cars. Assuming the standard deviation between individual cars is 2.3 miles per gallon, what should the researchers conclude if the sample mean is 49 miles per gallon and the P-value for the test is 0.0087?
(A) There is not sufficient evidence to reject the manufacturer’s claim; 49 miles per gallon is too close to the claimed 50 miles per gallon.
(B) The manufacturer’s claim should not be rejected because the P-value of .0087 is too small.
(C) The manufacturer’s claim should be rejected because the sample mean is less than the claimed mean.
(D) The P-value of .0087 is sufficient evidence to reject the manufacturer’s claim.
(E) The P -value of .0087 is sufficient evidence to prove that the manufacturer’s claim is false.
A 90% confidence interval for a population mean \(\mu\) is determined to be (800,900). If the confidence level is increased to 95% while the sample statistics and sample size remain same, the confidence interval for \(\mu\) becomes
(A) narrower
(B) 0.05
(C) wider
(D) 0.025
(C) does'nt change since the sample does'nt change.