Lecture 15 - Proportions, Percentages and Confidence Intervals

Penelope Pooler Eisenbies
MAS 261

2023-10-16

Housekeeping

Today’s plan 📋
- Comments and Questions about HW 5
- A few minutes for R Questions 🪄
- Review of Confidence Interval Concepts and Definitions
- Terminology for Proportion Estimates
- Point Estimates and Confidence Interval for Proportions
- Introduction to the concept of Hypotheses
- Examining Hypotheses using Confidence Intervals

Review: R and RStudio 🪄

Review: You have two options to facilitate your introduction to R and RStudio:
- Option 1: Create Posit Cloud account and download and install R and RStudio on your laptop.
- Option 2: Start with free Posit Cloud account and use that and later transition to using R/Rstudio on your laptop.
If you are comfortable with coding: Start with Option 1, but still sign up for Posit Cloud account.
- We will use Posit Cloud for Quizzes.
If you are nervous about coding: Choose Option 2.
For both options: I can help with download/install issues during office hours.
What I do: I maintain a Posit Cloud account for helping students but I do most of my work on my laptop.
NOTE: We will use R and RStudio in class during MOST lectures
- You can use either Posit Cloud or your laptop.

💥 Lecture 15 In-class Exercises - Q1 (Review) 💥

In Lecture 14 and HW 5, we cover the three components of the margin of error E which is the half width of the confidence interval.

Recall

CI Lower Bound = \(\overline{X}-E\)
CI Upper Bound = \(\overline{X}+E\)
\(E = \frac{S}{\sqrt{n}}\times t\) where
- S is the sample standard deviation
- n is the sample size
- t is determined by the confidence level and the degrees of freedom (df = n-1)
Which component of E, do we have no ability to control?
- S, n or t?

Questions from HW 5 and Quiz 1?

HW 5 is due Wednesday, 10/18 and the grace period is extended until 10/20 at idnight.
- BUT I will be out of town on 10/20.
- I leave early that morning and i will not have access to email.
- Sihang, the course TA is available by email (swang189@syr.edu)
If there are questions from HW 5 or Quiz 1 that are general and would benefit everyone, please let me know.
- We can go over them today (end of lecture) or on Thursday.
- Questions can be about concepts, R skills, etc.

Estimating a Proportion

Proportion a part of a whole population
Expressed as a value between 0 and 1
- Proportion x 100% = Percentage

Notation:
- Population proportion: P (Often unknown)
- Sample proportion: \(\hat{P}\) (from sample data)
- Confidence Interval for a Proportion
  - Lower Bound: \(\hat{P} - E\)
  - Upper Bound: \(\hat{P} + E\)
  - E is the margin of error (calculated a little differently than for quantitative data).

Example Data - Are US Drivers Ready to Go Fully Electric?

Example Data - Are US Drivers Still Skeptical of EVs?

These data were collected with a poll that had SIX response choices but we’ll use a common analytical technique to simplify the analyses.
We group the data into
- NO’s (Responded NOT Likely)
- Not NO’s (Did not respond NOT likely)
The pollsters interviewed 1025 Adults in the US (n = 1025)
- 584 of those interviewed responded ‘NOT Likely’
- 441 did not respond ‘NOT Likely’

Are we 95% confident that the majority of US adults are not ready for EVs?
Are we 99% confident that the majority of US adutls are not ready for EVs?

Calculating an Estimate and a Confidence Interval

Estimated proportion, \(\hat{P} = \frac{X}{n}=\frac{584}{1025} = 0.57\)
- X is the number of observations in category of interest and n = sample size
95% Confidence Interval
- correct = F is used because sample size is large enough so a continuity correction is not needed.
- Students in MAS 261 are not required to learn about when to use a continuity correction.
- All examples and questions will have sufficiently large sample sizes

prop.test(584, 1025, conf.level=.95, correct = F)


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5392410 0.5997503
sample estimates:
        p 
0.5697561

💥 Lecture 14 In-class Exercises - Q2 and Q3 💥

Interpreting the prop.test output:

Question 2:
- The 95% lower bound for the true proportion of US adults that are currently ‘NOT Likely’ to by an EV is ____.

Question 3:
- Fill in the blank: We are 95% confident that a majority of US adults are ____ to buy an EV at this time.
- Specify likely or not likely.

Would results change if we opt for a 99% Confidence Interval?

In Lecture 14 we discussed that if we want to be MORE confident that we have captured the true value, our interval will be wider.

Increased Confidence Level translates to WIDER Confidence Interval

prop.test(584, 1025, conf.level=.95, correct = F)


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5392410 0.5997503
sample estimates:
        p 
0.5697561

prop.test(584, 1025, conf.level=.99, correct = F)


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
 0.529599 0.609016
sample estimates:
        p 
0.5697561

In this case, conclusion based on confidence interval does not change because sample size is large.

How is this margin of Error, E, estimated?

In this case, t is NOT used to find E AND S is calculated differently
- t distribution not appropriate for categorical data, but by CLT we can use Z distribution:
  - 80% CI: Z = 1.282
  - 90% CI: Z = 1.645
  - 95% CI: Z = 1.960
  - 99% CI: Z = 2.576
- These are categorical (binomial) data so \(S = \sqrt{\hat{P}\times(1-\hat{P})}\)
  - Recall that \(\hat{P} = \frac{X}{n}\)
Margin of Error for Proportion Data:
- \(E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z\)

💥 Lecture 14 In-class Exercises - Q4 and Q5 💥

Question 4: What is the standard deviation for our EV poll data?

Recall that \(\hat{P} = \frac{584}{1025} = 0.57\)

Question 5: If we are calculating a 99% Confidence Interval for a proportion, what Z value should we use?

See previous slide or check the bottom of this t-table.

💥 Lecture 14 In-class Exercises - Q6 💥

What is the margin of error, E for the 99% Confidence Interval for the EV poll data?

\(E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z\)

Suggested strategy:
1. Divide answer to question 4 by \(\sqrt{n}=\sqrt{1025}\)
2. Multiply this ratio by Z = 2.576
3. Check work by using prop.test output (\(E = \frac{UB-LB}{2}\))

prop.test(584, 1025, conf.level=.99, correct = F)


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
 0.529599 0.609016
sample estimates:
        p 
0.5697561

Introduction to Hypotheses

So far this course has dealt with describing data and estimating values.
Often we want to go beyond description and estimation
- Based on what we see in the data, we want to formulate and test hypotheses.
- Hypotheses can have different formats based on type of data and questions being asked.
- In the next lecture we’ll talk about the formal language of hypothesis testing.
- Today we’ll discuss some concepts that show we have already been informally testing hypotheses.
- When looking at data, it is natural to develop hypotheses based on what we notice.
- Testing hypotheses is a formal way of examining the data, graphically and numerically.
- When we test our hypotheses, we are asking “Do these data support the ideas I have developed about this population?”

One more Look at the EV Poll 95% CI

**Hypothesis: US adults are evenly split (50-50) on whether or not to buy an EV.

Our data disproves this hypothesis IF the 95% confidence interval EXCLUDES 0.5.

prop.test(584, 1025, conf.level=.95, correct = F)


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5392410 0.5997503
sample estimates:
        p 
0.5697561

Our 95% interval (and our 99% interval) endpoints are both ABOVE 0.5, which disproves this hypothesis.
This conclusion matches the other parts of the output.
P-value is the probability of seeing these sample data if the specified hypothesis is true.
- This hypothesis is the Null Hypothesis that is the default for proportion tests.

A Previous Confidence Interval Example

Recall, the global mean number of subscribers is 21.89 million for the top 1000 YouTubers.

Is this random sample of 60 US YouTubers typical of the global top 1000?

Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers.

If data disprove this hypothesis:

95% percent confidence interval will NOT contain 21.89
p-value will be less that 0.05

yt60 <- read_csv("data/YouTube_US_60.csv", show_col_types = F)
t.test(yt60$subscribers_mil, mu=21.89)


    One Sample t-test

data:  yt60$subscribers_mil
t = -0.55738, df = 59, p-value = 0.5794
alternative hypothesis: true mean is not equal to 21.89
95 percent confidence interval:
 17.65192 24.28141
sample estimates:
mean of x 
 20.96667

💥 Lecture 14 In-class Exercises - Q7 💥

Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers which has a population mean of 21.89 million.

Does the t.test confidence interval output disprove this hypothesis?

t.test(yt60$subscribers_mil, mu=21.89)


    One Sample t-test

data:  yt60$subscribers_mil
t = -0.55738, df = 59, p-value = 0.5794
alternative hypothesis: true mean is not equal to 21.89
95 percent confidence interval:
 17.65192 24.28141
sample estimates:
mean of x 
 20.96667

Key Points from Today

Categorical Data can be simplified to two categories for analysis purposes
- Two category data - Estimate a proportion and a confidence interval.
  - \(\hat{P}=\frac{X}{n}\) where X = number of observations in category of interest.
- prop.test command is one option to estimate confidence interval
- Sample standard deviation for proportion data: \(S = \sqrt{\hat{P}\times(1-\hat{P})}\)
- Margin of Error: \(E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z\)
We will cover hypothesis tests more formally in coming lectures
- Today we tested hypotheses and drew conclusions based on estimated confidence intervals.

To submit an Engagement Question or Comment about material from Lecture 15: Submit by midnight today (day of lecture). Click on Link next to the ❓ under Lecture 15