MAS 261 - Lecture 16
Proportions, Percentages and Confidence Intervals
Housekeeping
Comments and Questions about HW 5
A few minutes for R Questions 🪄
Review of Confidence Interval Concepts and Definitions
Terminology for Proportion Estimates
Point Estimates and Confidence Interval for Proportions
Introduction to the concept of Hypotheses
Examining Hypotheses using Confidence Intervals
R and RStudio
In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
I demo how to download completed work so that you can use this allotment efficiently.
For those who want to go further with R/RStudio:
- I have added a new page to the MAS 261 website, Installing R and RStudio
Lecture 15 In-class Exercises - Q1
In Lecture 14 and HW 5, we cover the three components of the margin of error E which is the half width of the confidence interval.
Recall
CI Lower Bound = \(\overline{X}-E\)
CI Upper Bound = \(\overline{X}+E\)
\(E = \frac{S}{\sqrt{n}}\times t\) where
- S is the sample standard deviation
- n is the sample size
- t is determined by the confidence level and the degrees of freedom (df = n-1)
Which component of E, do we have no ability to control?
- S, n or t?
Questions from HW 5 and Quiz 1
HW 5 is due Friday, 10/18 and the grace period is extended until 10/21 at midnight.
I do not have office hours on Friday (10/18)
Sinuo, the course TA does have office hours on Friday
If there are questions from HW 5 or Quiz 1 that are general and would benefit everyone, please let me know.
- Questions can be about concepts, R skills, etc.
Estimating a Proportion
Proportion a part of a whole population
Expressed as a value between 0 and 1
- Proportion x 100% = Percentage
Notation:
Population proportion: P (Often unknown)
Sample proportion: \(\hat{P}\) (from sample data)
Confidence Interval for a Proportion
Lower Bound: \(\hat{P} - E\)
Upper Bound: \(\hat{P} + E\)
E is the margin of error (calculated a little differently than for quantitative data).
Are US Drivers Ready to Go Fully Electric?
Are US Drivers Still Skeptical of EVs?
These data were collected with a poll that had SIX response choices but we’ll use a common analytical technique to simplify the analyses.
We group the data into
NO’s (Responded NOT Likely)
Not NO’s (Did not respond NOT likely)
The pollsters interviewed 1025 Adults in the US (n = 1025)
584 of those interviewed responded ‘NOT Likely’
441 did not respond ‘NOT Likely’
Are we 95% confident that the majority of US adults are not ready for EVs?
Are we 99% confident that the majority of US adutls are not ready for EVs?
Calculating an Estimate and a Confidence Interval
Estimated proportion, \(\hat{P} = \frac{X}{n}=\frac{584}{1025} = 0.57\)
- X is the number of observations in category of interest and n = sample size
95% Confidence Interval
correct = F
is used because sample size is large enough so a continuity correction is not needed.Students in MAS 261 are not required to learn about when to use a continuity correction.
All examples and questions will have sufficiently large sample sizes
1-sample proportions test without continuity correction
data: 584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5392410 0.5997503
sample estimates:
p
0.5697561
Lecture 14 In-class Exercises - Q2 and Q3
Interpreting the prop.test output:
- Question 2:
- The 95% lower bound for the true proportion of US adults that are currently ‘NOT Likely’ to by an EV is
____
.
- The 95% lower bound for the true proportion of US adults that are currently ‘NOT Likely’ to by an EV is
- Question 3:
Fill in the blank: We are 95% confident that a majority of US adults are
____
to buy an EV at this time.Specify likely or not likely.
Would results change if we opt for a 99% Confidence Interval?
In Lecture 14 we discussed that if we want to be MORE confident that we have captured the true value, our interval will be wider.
- Increased Confidence Level translates to WIDER Confidence Interval
1-sample proportions test without continuity correction
data: 584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5392410 0.5997503
sample estimates:
p
0.5697561
1-sample proportions test without continuity correction
data: 584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
0.529599 0.609016
sample estimates:
p
0.5697561
- In this case, conclusion based on confidence interval does not change because sample size is large.
How is this margin of Error, E, estimated?
In this case, t is NOT used to find E AND S is calculated differently
t distribution not appropriate for categorical data, but by CLT we can use Z distribution:
- 80% CI: Z = 1.282
- 90% CI: Z = 1.645
- 95% CI: Z = 1.960
- 99% CI: Z = 2.576
These are categorical (binomial) data so \(S = \sqrt{\hat{P}\times(1-\hat{P})}\)
- Recall that \(\hat{P} = \frac{X}{n}\)
Margin of Error for Proportion Data:
- \(E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z\)
Lecture 14 In-class Exercises - Q4 and Q5
Question 4: What is the standard deviation for our EV poll data?
Recall that \(\hat{P} = \frac{584}{1025} = 0.57\)
Question 5: If we are calculating a 99% Confidence Interval for a proportion, what Z value should we use?
See previous slide or check the bottom of this t-table.
Lecture 14 In-class Exercises - Q6
What is the margin of error, E for the 99% Confidence Interval for the EV poll data?
\(E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z\)
Suggested strategy:
Divide answer to question 4 by \(\sqrt{n}=\sqrt{1025}\)
Multiply this ratio by Z
Check work by using
prop.test
output (\(E = \frac{UB-LB}{2}\))
1-sample proportions test without continuity correction
data: 584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
0.529599 0.609016
sample estimates:
p
0.5697561
Introduction to Hypotheses
So far this course has dealt with describing data and estimating values.
Often we want to go beyond description and estimation
Based on what we see in the data, we want to formulate and test hypotheses.
Hypotheses can have different formats based on type of data and questions being asked.
In the next lecture we’ll talk about the formal language of hypothesis testing.
Today we’ll discuss some concepts that show we have already been informally testing hypotheses.
When looking at data, it is natural to develop hypotheses based on what we notice.
Testing hypotheses is a formal way of examining the data, graphically and numerically.
When we test our hypotheses, we are asking “Do these data support the ideas I have developed about this population?”
One more Look at the EV Poll 95% CI
**Hypothesis: US adults are evenly split (50-50) on whether or not to buy an EV.
Our data disproves this hypothesis IF the 95% confidence interval EXCLUDES 0.5.
1-sample proportions test without continuity correction
data: 584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5392410 0.5997503
sample estimates:
p
0.5697561
Our 95% interval (and our 99% interval) endpoints are both ABOVE 0.5, which disproves this hypothesis.
This conclusion matches the other parts of the output.
P-value is the probability of seeing these sample data if the specified hypothesis is true.
- This hypothesis is the
Null Hypothesis
that is the default for proportion tests.
- This hypothesis is the
A Previous Confidence Interval Example
Recall, the global mean number of subscribers is 21.89 million for the top 1000 YouTubers.
Is this random sample of 60 US YouTubers typical of the global top 1000?
Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers.
If data disprove this hypothesis:
95% percent confidence interval will NOT contain 21.89
p-value will be less that 0.05
Code
One Sample t-test
data: yt60$subscribers_mil
t = -0.55738, df = 59, p-value = 0.5794
alternative hypothesis: true mean is not equal to 21.89
95 percent confidence interval:
17.65192 24.28141
sample estimates:
mean of x
20.96667
Lecture 14 In-class Exercises - Q7
Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers which has a population mean of 21.89 million.
Does the t.test
confidence interval output disprove this hypothesis?
Key Points from Today
Categorical Data can be simplified to two categories for analysis purposes
Two category data - Estimate a proportion and a confidence interval.
- \(\hat{P}=\frac{X}{n}\) where X = number of observations in category of interest.
prop.test
command is one option to estimate confidence intervalSample standard deviation for proportion data: \(S = \sqrt{\hat{P}\times(1-\hat{P})}\)
Margin of Error: \(E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z\)
We will cover hypothesis tests more formally in coming lectures
- Today we tested hypotheses and drew conclusions based on estimated confidence intervals.
To submit an Engagement Question or Comment about material from Lecture 16: Submit it by midnight today (day of lecture).