MAS 261 - Lecture 16

Proportions, Percentages and Confidence Intervals

Author

Penelope Pooler Eisenbies

Published

October 16, 2024

Housekeeping

Comments and Questions about HW 5
A few minutes for R Questions 🪄
Review of Confidence Interval Concepts and Definitions
Terminology for Proportion Estimates
Point Estimates and Confidence Interval for Proportions
Introduction to the concept of Hypotheses
Examining Hypotheses using Confidence Intervals

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - I have added a new page to the MAS 261 website, Installing R and RStudio

Lecture 15 In-class Exercises - Q1

In Lecture 14 and HW 5, we cover the three components of the margin of error E which is the half width of the confidence interval.

Recall

CI Lower Bound = $\overline{X}-E$
CI Upper Bound = $\overline{X}+E$
$E = \frac{S}{\sqrt{n}}\times t$ where
- S is the sample standard deviation
- n is the sample size
- t is determined by the confidence level and the degrees of freedom (df = n-1)
Which component of E, do we have no ability to control?
- S, n or t?

Questions from HW 5 and Quiz 1

HW 5 is due Friday, 10/18 and the grace period is extended until 10/21 at midnight.
- I do not have office hours on Friday (10/18)
- Sinuo, the course TA does have office hours on Friday
If there are questions from HW 5 or Quiz 1 that are general and would benefit everyone, please let me know.
- Questions can be about concepts, R skills, etc.

Estimating a Proportion

Proportion a part of a whole population
Expressed as a value between 0 and 1
- Proportion x 100% = Percentage

Notation:
- Population proportion: P (Often unknown)
- Sample proportion: $\hat{P}$ (from sample data)
- Confidence Interval for a Proportion
  - Lower Bound: $\hat{P} - E$
  - Upper Bound: $\hat{P} + E$
  - E is the margin of error (calculated a little differently than for quantitative data).

Are US Drivers Ready to Go Fully Electric?

Are US Drivers Still Skeptical of EVs?

These data were collected with a poll that had SIX response choices but we’ll use a common analytical technique to simplify the analyses.
We group the data into
- NO’s (Responded NOT Likely)
- Not NO’s (Did not respond NOT likely)
The pollsters interviewed 1025 Adults in the US (n = 1025)
- 584 of those interviewed responded ‘NOT Likely’
- 441 did not respond ‘NOT Likely’

Are we 95% confident that the majority of US adults are not ready for EVs?
Are we 99% confident that the majority of US adutls are not ready for EVs?

Calculating an Estimate and a Confidence Interval

Estimated proportion, $\hat{P} = \frac{X}{n}=\frac{584}{1025} = 0.57$
- X is the number of observations in category of interest and n = sample size
95% Confidence Interval
- correct = F is used because sample size is large enough so a continuity correction is not needed.
- Students in MAS 261 are not required to learn about when to use a continuity correction.
- All examples and questions will have sufficiently large sample sizes

Code

```{r echo=T}
prop.test(584, 1025, conf.level=.95, correct = F)
```


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5392410 0.5997503
sample estimates:
        p 
0.5697561

Lecture 14 In-class Exercises - Q2 and Q3

Interpreting the prop.test output:

Question 2:
- The 95% lower bound for the true proportion of US adults that are currently ‘NOT Likely’ to by an EV is ____.

Question 3:
- Fill in the blank: We are 95% confident that a majority of US adults are ____ to buy an EV at this time.
- Specify likely or not likely.

Would results change if we opt for a 99% Confidence Interval?

In Lecture 14 we discussed that if we want to be MORE confident that we have captured the true value, our interval will be wider.

Increased Confidence Level translates to WIDER Confidence Interval

Code

```{r echo=T}
prop.test(584, 1025, conf.level=.95, correct = F)
```


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5392410 0.5997503
sample estimates:
        p 
0.5697561

Code

```{r echo=T}
prop.test(584, 1025, conf.level=.99, correct = F)
```


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
 0.529599 0.609016
sample estimates:
        p 
0.5697561

In this case, conclusion based on confidence interval does not change because sample size is large.

How is this margin of Error, E, estimated?

In this case, t is NOT used to find E AND S is calculated differently
- t distribution not appropriate for categorical data, but by CLT we can use Z distribution:
  - 80% CI: Z = 1.282
  - 90% CI: Z = 1.645
  - 95% CI: Z = 1.960
  - 99% CI: Z = 2.576
- These are categorical (binomial) data so $S = \sqrt{\hat{P}\times(1-\hat{P})}$
  - Recall that $\hat{P} = \frac{X}{n}$
Margin of Error for Proportion Data:
- $E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z$

Lecture 14 In-class Exercises - Q4 and Q5

Question 4: What is the standard deviation for our EV poll data?

Recall that $\hat{P} = \frac{584}{1025} = 0.57$

Question 5: If we are calculating a 99% Confidence Interval for a proportion, what Z value should we use?

See previous slide or check the bottom of this t-table.

Lecture 14 In-class Exercises - Q6

What is the margin of error, E for the 99% Confidence Interval for the EV poll data?

$E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z$

Suggested strategy:
1. Divide answer to question 4 by $\sqrt{n}=\sqrt{1025}$
2. Multiply this ratio by Z
3. Check work by using prop.test output ($E = \frac{UB-LB}{2}$)

Code

```{r echo=T}
prop.test(584, 1025, conf.level=.99, correct = F)
```


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
99 percent confidence interval:
 0.529599 0.609016
sample estimates:
        p 
0.5697561

Introduction to Hypotheses

So far this course has dealt with describing data and estimating values.
Often we want to go beyond description and estimation
- Based on what we see in the data, we want to formulate and test hypotheses.
- Hypotheses can have different formats based on type of data and questions being asked.
- In the next lecture we’ll talk about the formal language of hypothesis testing.
- Today we’ll discuss some concepts that show we have already been informally testing hypotheses.
- When looking at data, it is natural to develop hypotheses based on what we notice.
- Testing hypotheses is a formal way of examining the data, graphically and numerically.
- When we test our hypotheses, we are asking “Do these data support the ideas I have developed about this population?”

One more Look at the EV Poll 95% CI

**Hypothesis: US adults are evenly split (50-50) on whether or not to buy an EV.

Our data disproves this hypothesis IF the 95% confidence interval EXCLUDES 0.5.

Code

```{r echo=T}
prop.test(584, 1025, conf.level=.95, correct = F)
```


    1-sample proportions test without continuity correction

data:  584 out of 1025, null probability 0.5
X-squared = 19.95, df = 1, p-value = 0.000007948
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5392410 0.5997503
sample estimates:
        p 
0.5697561

Our 95% interval (and our 99% interval) endpoints are both ABOVE 0.5, which disproves this hypothesis.
This conclusion matches the other parts of the output.
P-value is the probability of seeing these sample data if the specified hypothesis is true.
- This hypothesis is the Null Hypothesis that is the default for proportion tests.

A Previous Confidence Interval Example

Recall, the global mean number of subscribers is 21.89 million for the top 1000 YouTubers.

Is this random sample of 60 US YouTubers typical of the global top 1000?

Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers.

If data disprove this hypothesis:

95% percent confidence interval will NOT contain 21.89
p-value will be less that 0.05

Code

```{r echo=T}
yt60 <- read_csv("data/YouTube_US_60.csv", show_col_types = F)
t.test(yt60$subscribers_mil, mu=21.89)
```


    One Sample t-test

data:  yt60$subscribers_mil
t = -0.55738, df = 59, p-value = 0.5794
alternative hypothesis: true mean is not equal to 21.89
95 percent confidence interval:
 17.65192 24.28141
sample estimates:
mean of x 
 20.96667

Lecture 14 In-class Exercises - Q7

Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers which has a population mean of 21.89 million.

Does the t.test confidence interval output disprove this hypothesis?

Code

```{r echo=T}
t.test(yt60$subscribers_mil, mu=21.89)
```


    One Sample t-test

data:  yt60$subscribers_mil
t = -0.55738, df = 59, p-value = 0.5794
alternative hypothesis: true mean is not equal to 21.89
95 percent confidence interval:
 17.65192 24.28141
sample estimates:
mean of x 
 20.96667

Key Points from Today

Categorical Data can be simplified to two categories for analysis purposes
- Two category data - Estimate a proportion and a confidence interval.
  - $\hat{P}=\frac{X}{n}$ where X = number of observations in category of interest.
- prop.test command is one option to estimate confidence interval
- Sample standard deviation for proportion data: $S = \sqrt{\hat{P}\times(1-\hat{P})}$
- Margin of Error: $E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z$
We will cover hypothesis tests more formally in coming lectures
- Today we tested hypotheses and drew conclusions based on estimated confidence intervals.

To submit an Engagement Question or Comment about material from Lecture 16: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 16" subtitle: "Proportions, Percentages and Confidence Intervals" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions) # verify packages # p_loaded() ``` - Comments and Questions about HW 5 - A few minutes for R Questions 🪄 - Review of Confidence Interval Concepts and Definitions - Terminology for Proportion Estimates - Point Estimates and Confidence Interval for Proportions - Introduction to the concept of Hypotheses - Examining Hypotheses using Confidence Intervals ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"} - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - I have added a new page to the MAS 261 website, [Installing R and RStudio](https://penelope2040.quarto.pub/mas-261/#installing-r-and-rstudio){target="_blank"} ## ### Lecture 15 In-class Exercises - Q1 In Lecture 14 and HW 5, we cover the three components of the margin of error E which is the half width of the confidence interval. Recall - CI Lower Bound = $\overline{X}-E$ - CI Upper Bound = $\overline{X}+E$ - $E = \frac{S}{\sqrt{n}}\times t$ where - S is the sample standard deviation - n is the sample size - t is determined by the confidence level and the degrees of freedom (df = n-1) - Which component of E, do we have no ability to control? - S, n or t? ## Questions from HW 5 and Quiz 1 - HW 5 is due Friday, 10/18 and the grace period is extended until 10/21 at midnight. - I do not have office hours on Friday (10/18) - Sinuo, the course TA does have office hours on Friday - If there are questions from HW 5 or Quiz 1 that are general and would benefit everyone, please let me know. - Questions can be about concepts, R skills, etc. ## Estimating a Proportion - Proportion a part of a whole population - Expressed as a value between 0 and 1 - Proportion x 100% = Percentage - Notation: - Population proportion: **P (Often unknown)** - Sample proportion: $\hat{P}$ (from sample data) - Confidence Interval for a Proportion - Lower Bound: $\hat{P} - E$ - Upper Bound: $\hat{P} + E$ - E is the margin of error (calculated a little differently than for quantitative data). ## ### Are US Drivers Ready to Go Fully Electric? ```{r fig.align='center'} knitr::include_graphics("img/EV_10_2023.png", dpi = 300) ``` ## ### Are US Drivers Still Skeptical of EVs? - These data were collected with a poll that had SIX response choices but we'll use a common analytical technique to simplify the analyses. - We group the data into - NO's (Responded NOT Likely) - Not NO's (Did not respond NOT likely) - The pollsters interviewed 1025 Adults in the US (n = 1025) - 584 of those interviewed responded 'NOT Likely' - 441 did not respond 'NOT Likely' - Are we 95% confident that the majority of US adults are not ready for EVs? - Are we 99% confident that the majority of US adutls are not ready for EVs? ## ### Calculating an Estimate and a Confidence Interval - Estimated proportion, $\hat{P} = \frac{X}{n}=\frac{584}{1025} = 0.57$ - X is the number of observations in category of interest and n = sample size - 95% Confidence Interval - `correct = F` is used because sample size is large enough so a continuity correction is not needed. - Students in MAS 261 are not required to learn about when to use a continuity correction. - All examples and questions will have sufficiently large sample sizes ::: fragment ```{r echo=T} prop.test(584, 1025, conf.level=.95, correct = F) ``` ::: ## ### Lecture 14 In-class Exercises - Q2 and Q3 **Interpreting the prop.test output:** - **Question 2:** - **The 95% lower bound for the true proportion of US adults that are currently 'NOT Likely' to by an EV is `____`.** - **Question 3:** - **Fill in the blank: We are 95% confident that a majority of US adults are `____` to buy an EV at this time.** - Specify likely or not likely. ## ### Would results change if we opt for a 99% Confidence Interval? In Lecture 14 we discussed that if we want to be MORE confident that we have captured the true value, our interval will be wider. - Increased Confidence Level translates to WIDER Confidence Interval ::::::: columns :::: {.column width="50%"} ::: fragment ```{r echo=T} prop.test(584, 1025, conf.level=.95, correct = F) ``` ::: :::: :::: {.column width="50%"} ::: fragment ```{r echo=T} prop.test(584, 1025, conf.level=.99, correct = F) ``` ::: :::: ::::::: - In this case, conclusion based on confidence interval does not change because sample size is large. ## How is this margin of Error, E, estimated? - In this case, t is NOT used to find E AND S is calculated differently - t distribution not appropriate for categorical data, but by CLT we can use Z distribution: - 80% CI: Z = 1.282 - 90% CI: Z = 1.645 - 95% CI: Z = 1.960 - 99% CI: Z = 2.576 - These are categorical (binomial) data so $S = \sqrt{\hat{P}\times(1-\hat{P})}$ - Recall that $\hat{P} = \frac{X}{n}$ - Margin of Error for Proportion Data: - $E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z$ ## ### Lecture 14 In-class Exercises - Q4 and Q5 **Question 4: What is the standard deviation for our EV poll data?** Recall that $\hat{P} = \frac{584}{1025} = 0.57$ **Question 5: If we are calculating a 99% Confidence Interval for a proportion, what Z value should we use?** See previous slide or check the bottom of this [t-table](https://drive.google.com/file/d/1eyyIi26ekoV35wPQ6A7tRQTTAdKN2Pug/view?usp=sharing){target="_blank"}. ## ### Lecture 14 In-class Exercises - Q6 **What is the margin of error, E for the 99% Confidence Interval for the EV poll data?** $E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z$ - Suggested strategy: 1. Divide answer to question 4 by $\sqrt{n}=\sqrt{1025}$ 2. Multiply this ratio by Z 3. Check work by using `prop.test` output ($E = \frac{UB-LB}{2}$) ::: fragment ```{r echo=T} prop.test(584, 1025, conf.level=.99, correct = F) ``` ::: ## Introduction to Hypotheses - So far this course has dealt with describing data and estimating values. - Often we want to go beyond description and estimation - Based on what we see in the data, we want to formulate and test hypotheses. - Hypotheses can have different formats based on type of data and questions being asked. - In the next lecture we'll talk about the formal language of hypothesis testing. - Today we'll discuss some concepts that show we have already been informally testing hypotheses. - When looking at data, it is natural to develop hypotheses based on what we notice. - Testing hypotheses is a formal way of examining the data, graphically and numerically. - **When we test our hypotheses, we are asking "Do these data support the ideas I have developed about this population?"** ## One more Look at the EV Poll 95% CI \*\*Hypothesis: US adults are evenly split (50-50) on whether or not to buy an EV. Our data disproves this hypothesis IF the 95% confidence interval EXCLUDES 0.5. ```{r echo=T} prop.test(584, 1025, conf.level=.95, correct = F) ``` - Our 95% interval (and our 99% interval) endpoints are both ABOVE 0.5, which disproves this hypothesis. - This conclusion matches the other parts of the output. - **P-value** is the probability of seeing these sample data if the specified hypothesis is true. - This hypothesis is the `Null Hypothesis` that is the default for proportion tests. ## A Previous Confidence Interval Example Recall, the global mean number of subscribers is 21.89 million for the top 1000 YouTubers. Is this random sample of 60 US YouTubers typical of the global top 1000? **Null Hypothesis:** These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers. If data disprove this hypothesis: - 95% percent confidence interval will NOT contain 21.89 - p-value will be less that 0.05 ::: fragment ```{r echo=T} yt60 <- read_csv("data/YouTube_US_60.csv", show_col_types = F) t.test(yt60$subscribers_mil, mu=21.89) ``` ::: ## ### Lecture 14 In-class Exercises - Q7 Null Hypothesis: These 60 randomly sampled US YouTubers are not different from the larger global population of top YouTubers which has a population mean of 21.89 million. **Does the `t.test` confidence interval output disprove this hypothesis?** ```{r echo=T} t.test(yt60$subscribers_mil, mu=21.89) ``` ## ### Key Points from Today - Categorical Data can be simplified to two categories for analysis purposes - Two category data - Estimate a proportion and a confidence interval. - $\hat{P}=\frac{X}{n}$ where X = number of observations in category of interest. - `prop.test` command is one option to estimate confidence interval - Sample standard deviation for proportion data: $S = \sqrt{\hat{P}\times(1-\hat{P})}$ - Margin of Error: $E = \frac{S}{\sqrt{n}}\times Z = \sqrt{\frac{\hat{P}\times(1-\hat{P})}{n}}\times Z$ - We will cover hypothesis tests more formally in coming lectures - Today we tested hypotheses and drew conclusions based on estimated confidence intervals. ::: fragment **To submit an Engagement Question or Comment about material from Lecture 16:** Submit it by midnight today (day of lecture). :::