Confidence Intervals & QQ plots

2024-03-17

Confidence Intervals

The three main components of a confidence interval include point estimate, confidence level and margin of error.
The point estimate is basically the single number which is the best guess for a parameter.
the confidence level representing the probability that the parameter exists within the interval produced by statistical methods.
Lastly, the margin of error is a measure of sampling error.

Confidence Intervals contd. Latex Slide 1

Point Estimate can be represented by \(\bar{x}\)
Margin of Error can be calculated with \(E = z_{\alpha/{2}}\frac{\sigma}{\sqrt{n}}\)
The confidence interval is found by adding/subtracting these two values: \(\bar{x} \pm z_{\alpha/{2}}\frac{\sigma}{\sqrt{n}}\)

Dataset `Orange`

Let’s utilize this concept on the Orange data set. The confidence interval formula can be used for large sample sizes or if the population is normally distributed.

data(Orange)
head(Orange)

##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142

Plotly

The following plot represents the circumference variable against the age of the trees, factored by the type of tree, and the frequency.

Point Estimate & Margin of Error - Latex Slide 2

pointest <- mean(Orange$circumference)
pointest

## [1] 115.8571

z_score <- qnorm(p = 0.025, lower.tail = FALSE) 
ME <- z_score * (sd(Orange$circumference)/
                   sqrt(length(Orange$circumference)))
ME

## [1] 19.04551

\(E = z_{\alpha/{2}}\frac{\sigma}{\sqrt{n}} = z_{0.05/{2}}\frac{\text{standard deviation }}{\sqrt{\text{sample size}}} = 19.05\)

Point Estimate & Margin of Error - Latex Slide 3 - Explanation through R Statistical Methods

The point estimate that best represents this dataset is the mean.
The margin of error was calculated by first computing z-score: \(z_{\alpha/{2}}\)
The \(\alpha\) represents the area to the right, which is utilized in the qnorm function which calculated the right-tail test for 95% confidence interval. Since it is a right-tail test, we utilized p = 0.025 which represents 2.5% of the area to the right.

In certain situations, sample size is not given, so we can calculate it using z-score, standard deviation, and margin of error:

\(n = \frac{(z_{\alpha/{2}})^2\sigma^2}{E^2}\)

Confidence Interval - Contd.

Now we can calculate the parameter:

P1 <- pointest - ME
P2 <- pointest + ME


confidence_interval <- c(P1, P2)
confidence_interval

## [1]  96.81163 134.90265

This shows that the true population parameter for the circumference of orange trees in the industry is 95% probable that it lies between 96.81 and 134.90.

GGPlot Slide 1

Let’s plot the frequency distribution of the circumference variable.

Introduction to QQ Plot

This data is not normally distributed. If we want to compare the dataset to a normal plot, let’s use a QQplot. A QQ plot compares the values of a dataset to the theoretical values of a normally distributed plot.

The deviation of the points in the Orange dataset from the qq line of a normally distributed plot examines how fitted the Orange dataset is compared to a normal dataset.

GGPlot 2 - QQPlot

ggplot(Orange, aes(sample = circumference)) +
  stat_qq() + stat_qq_line(color = "red") + ggtitle("QQ Plot") +
  xlab("theoretical quantities") + ylab("sample quantities")