2024-03-17

Confidence Intervals

  • The three main components of a confidence interval include point estimate, confidence level and margin of error.
  • The point estimate is basically the single number which is the best guess for a parameter.
  • the confidence level representing the probability that the parameter exists within the interval produced by statistical methods.
  • Lastly, the margin of error is a measure of sampling error.

Confidence Intervals contd. Latex Slide 1

  • Point Estimate can be represented by \(\bar{x}\)
  • Margin of Error can be calculated with \(E = z_{\alpha/{2}}\frac{\sigma}{\sqrt{n}}\)
  • The confidence interval is found by adding/subtracting these two values: \(\bar{x} \pm z_{\alpha/{2}}\frac{\sigma}{\sqrt{n}}\)

Dataset Orange

Let’s utilize this concept on the Orange data set. The confidence interval formula can be used for large sample sizes or if the population is normally distributed.

data(Orange)
head(Orange)
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142

Plotly

The following plot represents the circumference variable against the age of the trees, factored by the type of tree, and the frequency.

Point Estimate & Margin of Error - Latex Slide 2

pointest <- mean(Orange$circumference)
pointest
## [1] 115.8571
z_score <- qnorm(p = 0.025, lower.tail = FALSE) 
ME <- z_score * (sd(Orange$circumference)/
                   sqrt(length(Orange$circumference)))
ME          
## [1] 19.04551

\(E = z_{\alpha/{2}}\frac{\sigma}{\sqrt{n}} = z_{0.05/{2}}\frac{\text{standard deviation }}{\sqrt{\text{sample size}}} = 19.05\)

Point Estimate & Margin of Error - Latex Slide 3 - Explanation through R Statistical Methods

  • The point estimate that best represents this dataset is the mean.

  • The margin of error was calculated by first computing z-score: \(z_{\alpha/{2}}\)

  • The \(\alpha\) represents the area to the right, which is utilized in the qnorm function which calculated the right-tail test for 95% confidence interval. Since it is a right-tail test, we utilized p = 0.025 which represents 2.5% of the area to the right.

In certain situations, sample size is not given, so we can calculate it using z-score, standard deviation, and margin of error:

\(n = \frac{(z_{\alpha/{2}})^2\sigma^2}{E^2}\)

Confidence Interval - Contd.

Now we can calculate the parameter:

P1 <- pointest - ME
P2 <- pointest + ME


confidence_interval <- c(P1, P2)
confidence_interval
## [1]  96.81163 134.90265

This shows that the true population parameter for the circumference of orange trees in the industry is 95% probable that it lies between 96.81 and 134.90.

GGPlot Slide 1

Let’s plot the frequency distribution of the circumference variable.

Introduction to QQ Plot

This data is not normally distributed. If we want to compare the dataset to a normal plot, let’s use a QQplot. A QQ plot compares the values of a dataset to the theoretical values of a normally distributed plot.

The deviation of the points in the Orange dataset from the qq line of a normally distributed plot examines how fitted the Orange dataset is compared to a normal dataset.

GGPlot 2 - QQPlot

ggplot(Orange, aes(sample = circumference)) +
  stat_qq() + stat_qq_line(color = "red") + ggtitle("QQ Plot") +
  xlab("theoretical quantities") + ylab("sample quantities")