Part I: Estimating probabilities

load the mosaic package first:

library(mosaic)

Chile referendum data

In this part we will use the dataset Chile.

Chile <- read.table("http://asta.math.aau.dk/dan/static/datasets?file=Chile.dat", header=TRUE, quote="\"")

Do a cross tabulation of the variables vote and sex.

tally <- xtabs(~vote+sex, data = Chile)
tally

##     sex
## vote   F   M
##    A 104  83
##    N 363 526
##    U 362 226
##    Y 480 388

Estimate the probability of vote=N.

Our estimate is 35.11%

Make a 95% confidence interval for the probability of vote=N.

The 95 percent confidence interval is between: 0.3325575 and 0.3701101

p = 0.3511058

Since p is inside our confidence interval, our hypothesis holds. ????

Estimate the probability of vote=N given that sex=F.

The estimated probability is 14.33%

What would these probabilities satisfy if vote and sex were statistically independent?

If vote is not dependendent on sex, that means we can filter by different sexes and we will still get the same results of a percentage of votes for different candidates.

In our case these variables are dependent on each other, because the percentage of women voting N is 14.33%, but if we take men, then percentage is 20.77%.

Part II: Sampling distributions and the central limit theorem

This is a purely theoretical exercise where we investigate the random distribution of samples from a known population.

Waiting times in a queue

We start by sampling data from the so-called exponential distribution - also called the negative exponential distribution. The exponential distribution is the most common distribution used to describe the waiting time between arrivals in a queue. It has one parameter, which is the number of arrivals per time unit, also called the arrival rate. In our case we set it to 1 arrival per time unit. Since the arrival rate in our theoretical population is 1, the mean waiting time for the population will be \(\mu=1\). Furthermore, it can be shown that the standard deviation is \(\sigma=1\).

The following commands randomly samples 25 waiting times y and calculates the mean of these y_bar.

y <- rexp(25, rate = 1)
y

##  [1] 1.35549680 0.03830973 0.13247754 0.07264786 0.16162274 0.68402546
##  [7] 0.70456021 0.55635454 1.73838975 0.17380264 0.01836751 1.66725620
## [13] 1.09646085 0.03353483 0.21166849 0.79651158 0.13646549 0.65812966
## [19] 0.60010534 2.28095025 1.82999858 0.27017926 0.34309223 0.52959207
## [25] 0.12904653

y_bar <- mean(y)
y_bar

## [1] 0.6487618

Note: Since it is a random sample from the population your numbers will be different. Try to rerun the commands a few times.

The following command replicates the sampling experiment 1000 times and saves the result as a matrix(y) with 25 rows and 1000 columns:

y <- replicate(1000, rexp(25, rate = 1))

The mean(y_bar) is calculated for each of the 1000 replications (i.e. each entry in y_bar is the average of the 25 values in the corresponding column):

y_bar <- colMeans(y)

Make a histogram of all the sampled waiting times using a command like histogram(as.numeric(y), breaks = 40) inserted in a new code chunk (try to do experiments with the number of breaks):

histogram(as.numeric(y), breaks = 30)

histogram(as.numeric(y), breaks = 200)

- Explain how a histogram is constructed. #svar X-axis shows how big percentage each group is. Y-axis shows the values of the different groups. Breakpoints decides the size of each column, by giving how many columns are presented in the histogram.

Does this histogram look like a normal distribution?

svar

No it does not look like a normal distribution, because we start with a big value and then falls. It isn’t a bell curve, so it is not a normal distribution.

Now we focus on the mean waiting times y_bar.

Based on the known population parameters \(\mu = 1\) and \(\sigma = 1\) what is the the mean, standard deviation and approximate distribution of y_bar according to the CLT?

hist(y[1,], col="red", main="Histogram of 1st Sample", xlab="Observation")

# histogram of averages - include prob
hist(y_bar, 
     breaks=10, 
     probability=TRUE,
     main="Distribution of Sample Means",
     xlab="Sample Mean")

# add vertical lines for mean of the means and the theoretical means
abline(v=mean(y_bar), col="red", lwd=4)
abline(v=1/1, col="blue", lty=22, lwd=4)

## add probablility density curves for the sample and theoritical distribution
sampleDensity <- density(y_bar)
lines(sampleDensity, lwd=4, col="red")

# theoritical density
mu = 1
sigma = mu / sqrt(25)
xvalues <- seq(min(y_bar), max(y_bar), length=100)
lines(xvalues, dnorm(xvalues, mean=mu, sd=sigma), col="blue", lwd="4", lty=2)

# add legend
legend('topright', c("simulation", "theoretical"), lty=c(1,2), lwd=2, col=c("red", "blue"))

comparisonTable <- matrix(c(mean(y_bar), 
            mu,
            sd(y_bar),
            sigma,
            var(y_bar),
            sigma^2), 2)

colnames(comparisonTable) <- c("Mean","Standard Deviation", "Variance")
rownames(comparisonTable) <- c("Sample Means", "Theoretical")

What are the theoretical quartiles based on this approximate distribution of y_bar?

quantile(y_bar)

##        0%       25%       50%       75%      100% 
## 0.4995847 0.8636440 0.9882406 1.1212914 1.7458086

Compare the predicted values of mean, standard deviation and quartiles with the observed values (you can use favstats to calculate these from y_bar).

comparisonTable

##                  Mean Standard Deviation   Variance
## Sample Means 0.999303          0.1957406 0.03831438
## Theoretical  1.000000          0.2000000 0.04000000

Make a histogram of the sample means (y_bar). Does it look like a normal distribution?

hist(y_bar)

Svar

This looks like a normal distribution, because it has the bell shape.

Make a boxplot of the sample means and explain how a boxplot is constructed.

boxplot(y_bar)

Part III: Theoretical boxplot for a normal distribution

Finally, consider the theoretical boxplot of a general normal distribution with mean \(\mu\) and standard deviation \(\sigma\), and find the probability of being an outlier according to the 1.5\(\cdot\)IQR criterion:

First find the \(z\)-score of the lower/upper quartile. I.e. the value of \(z\) such that \(\mu \pm z\sigma\) is the lower/upper quartile.

FIND NAME FOR QNORM

qnorm(0.25)

## [1] -0.6744898

qnorm(0.75)

## [1] 0.6744898

Use this to find the IQR (expressed in terms of \(\sigma\)).

qnorm(0.75) - qnorm(0.25)

## [1] 1.34898

In order to find the IQR, we need to substract quartile 1 from quartile 3 and multiply with ????. Then we get the median where the most values are.

Now find the \(z\)-score of the maximal extent of the whisker. I.e. the value of \(z\) such that \(\mu \pm z\sigma\) is the endpoint of lower/upper whisker.

Distance for maximum and minimum extent of the whisker is the middle region (the most common 50% of measurements) multiplied by 1.5. All the points that are outside of this region will be drawn as normal points and will not be included in boxplot.

Find the probability of being an outlier.

Take all the points that are outside the whisker and divide that amount by the whole amount of points. This is going to be a small probablility.