Introduction

The Central Limit Theorem is part of the foundation for theoretical-based statistical inference. If certain assumptions are met, then the sampling distribution of certain sample statistics will be approximately normally distributed with mean equal to the population parameter and standard error proportional to the inverse of the square root of the sample size. Specifically,

for a single mean \[\bar{x} \sim Normal\bigg(\mu, \frac{\sigma}{\sqrt{n}}\bigg),\]
and for a single proportion, \[\hat{p} \sim Normal\bigg(p, \sqrt{\frac{p(1-p)}{{n}}}\bigg).\]
We assume that we have a random sample of at least size 30. If sampling without replacement, the sample size must be less than 10% of the overall population size. For categorical data, we must have at least 10 successes and 10 failures.

To get started, load packages tidyverse, infer, and gifski. Install any packages with code install.packages("package_name").

library(tidyverse)
library(infer)
library(gifski)

Central Limit Theorem

We will visualize the result of the Central Limit Theorem if we assume our population follows a Poisson distribution with \(\lambda=3\).

In probability theory and statistics, the Poisson distribution, named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.

More details on the Poisson distribution can be found via the link in the References section.

Poisson population

Generate 1,000,000 Poisson random variables with \(\lambda=3\) and save them as a tibble object population.
Plot a histogram of the distribution of the population. Use expression(paste("Poisson distribution with ", lambda, " = 3", sep = "")) as your title. Comment on the shape, center, and spread of the distribution.

Compute the mean and standard deviation of population.

Sampling distribution of \(\bar{x}\)

Take 1,000 random samples of size 36 from population. Compute the mean of each sample, and then use ggplot() to visualize the sampling distribution of \(\bar{x}\). Create a plot similar to what you see below.

Use the following plot parameters:
- bins = 20
- title = expression(paste("Sampling distribution: ", bar(x), sep = ""))
- caption = expression(paste("Population is Poisson with ", lambda, " = 3", sep = "")))

Compute the mean and standard deviation of the above sampling distribution.
As the sample size gets larger, the normal distribution approximation gets better. Let’s visualize this with a GIF! Here is where package gifski comes into action.

To create your GIF, create a for loop that loops over sample size values from 5 to 80 in increments of 5. You can create this vector to loop over with seq(5, 80, 5). The body of the loop will be your code above. Modify a few parts to reflect where the sample size will change with each loop iteration. You will need to put the body of your loop code inside function print().

Utilize the following chunk options:
- animation.hook='gifski', interval=2 (set up your GIF and time between images)
- fig.width=6, fig.height=5 (adjust figure dimensions)
- cache=TRUE (save knitting time later)

Comment on the shape, center, and spread of the distribution as the sample size increases.

Theoretical-based inference

This is from ica-03-12-19.

Data come from Gallup’s most recent survey of the country, conducted Sept. 27-Nov. 28, 2018, a few months before Juan Guaido was sworn in as interim president of Venezuela on Jan. 23. Guaido is the head of the country’s opposition and the president of the Venezuelan National Assembly. His assumption of the presidency was a direct challenge to Nicolas Maduro, who has presided over the country since former Venezuelan President Hugo Chavez died in 2013.

From a survey of 1,000 Venezuelan adults, only 53% reported having enough money for adequate shelter.

Use the theoretical results (confidence interval formula) to compute a 99% confidence interval for the proportion of all Venezuelan adults that have enough money for adequate shelter. You may assume all the necessary assumptions are satisfied.
Perform a hypothesis test to determine if the survey results provide convincing evidence that a majority of Venezuelan adults can afford an adequate shelter. Use a 5% significance level?

Theoretical-based inference

Shawn Santo

March 19, 2019

Introduction

Central Limit Theorem

Poisson population

Sampling distribution of \(\bar{x}\)

Theoretical-based inference

References