2024-09-22

Slide 2: Confidence Intervals

A confidence interval is a statistical tool which allows us to create numerical boundaries/ the upper and the lower limits, so if we were to draw multiple samples from some kind of population, we would be able to predict that a specific parameter will land within those boundaries with some percent of certainty. A 95% Confidence Interval suggests that we are 95% of the times we sample, the selected by us sampled value, will represent the true value of the according total population.

Slide 3: Confidence Intervals with Z-Scores

A confidence interval can be calculated using z-scores or t-scores. T scores are used when the population’s standard deviation is unknown, while Z-scores are used when the sd is known. For both cases the data has to follow a normal distribution, which can be identified as data the shape of which reminds of a “bell curve”, and is not skewed to the right ot left.

For this presentation I will use the Iris data set which is built-in in the RStudio.

data("iris")
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(plotly)
Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Slide 4 : Confidence Intervals with Iris Data Set

For the purpose of this presentation I am going to look specifically on the petal lengths of the iris flowers. As a first step I will examine the petal lengths by species, there are 3 species presented in total.

Plot

Slide 5: Confidence Intervals with Iris Data Set

In order to see more clearly how the petal length data distributes for each specie, I now use the histogram. Here I can see that Setosa specie follows a unimodal distribution reminding of the normal bell curve. The other two species have a bimodal distribution and need to be further normalized. I will proceed my presentation focusing on the Setosa specie of the Iris Flower.

Plot

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Slide 6: Filtering the Data from Iris Data Set

Code to filter the data from the Iris data set, we are extracting the values for petal length of setosa specie only.

setosaLenghts <- select(.data=iris, Petal.Length, Species) %>%
  filter(Species == "setosa")

setosaLenghts
   Petal.Length Species
1           1.4  setosa
2           1.4  setosa
3           1.3  setosa
4           1.5  setosa
5           1.4  setosa
6           1.7  setosa
7           1.4  setosa
8           1.5  setosa
9           1.4  setosa
10          1.5  setosa
11          1.5  setosa
12          1.6  setosa
13          1.4  setosa
14          1.1  setosa
15          1.2  setosa
16          1.5  setosa
17          1.3  setosa
18          1.4  setosa
19          1.7  setosa
20          1.5  setosa
21          1.7  setosa
22          1.5  setosa
23          1.0  setosa
24          1.7  setosa
25          1.9  setosa
26          1.6  setosa
27          1.6  setosa
28          1.5  setosa
29          1.4  setosa
30          1.6  setosa
31          1.6  setosa
32          1.5  setosa
33          1.5  setosa
34          1.4  setosa
35          1.5  setosa
36          1.2  setosa
37          1.3  setosa
38          1.4  setosa
39          1.3  setosa
40          1.5  setosa
41          1.3  setosa
42          1.3  setosa
43          1.3  setosa
44          1.6  setosa
45          1.9  setosa
46          1.4  setosa
47          1.6  setosa
48          1.4  setosa
49          1.5  setosa
50          1.4  setosa

Slide 7: Confidence Interval Calculation

In order to calculate a confidence interval using z-scores, several parameters need to me obtained.

\[\bar{X}= SampleMean\] \[z = Confidence Level Value (standard)\] \[sd = Sample Standard Deviation\] \[n = Sample Size\]

Slide 8: Mean and Standard Error Calculation

To get the mean Length:

meanPetalLength_Setosa <- mean(setosaLenghts$Petal.Length)
meanPetalLength_Setosa
[1] 1.462

To get the sample size:

sample_size_n<- n <- length(setosaLenghts$Petal.Length)
sample_size_n
[1] 50

Slide 8: Mean and Standard Error Calculation (Cont.)

To get standard deviation of the sample:

sd_SetosaPL <- sd(setosaLenghts$Petal.Length)
sd_SetosaPL
[1] 0.173664

To get the standard error of the sample:

se_SetosaPL <- sd_SetosaPL / sqrt(sample_size_n)
se_SetosaPL
[1] 0.0245598

Slide 9: Margin of Error Calculation

After the values have been obtained from the data set, they are ready to be put into the Confidence Interval fofmula for the Z scores.

\[CI = \bar{X} \pm (z \times \frac{sd}{\sqrt{n}})\]

##used the Zscore of 1.96 as it is the standard Zscore value for a 95% confidence interval
marginOfError_95CI <- 1.96*(se_SetosaPL)

# Calculating the lower bound of the conf interval
lower_bound <- meanPetalLength_Setosa - marginOfError_95CI

Slide 9: Margin of Error Calculation (Cont.)

# Calculating the upper bound  of the conf interval
upper_bound <- meanPetalLength_Setosa + marginOfError_95CI

lower_bound
[1] 1.413863
upper_bound
[1] 1.510137

This means that if we randomly select an iris setosa flower from this sample we can sure that 95% of the time, its petal length will be between 1.413863 and 1.510137 units (units were not specified in the data set description)

Slide 10: Setosa Iris Petal Length Distribution vs its Width

Graph:

Slide 10: Code for the Plotly Plot

SetLvsW <-
 plot_ly(x = setosaLenghts$Petal.Length, y = setosaPetalWidth$Petal.Width, 
  type = 'scatter', mode = 'markers') %>%
layout( xaxis = list(title = 'Petal Length'), yaxis = 
list(title = 'Petal Width'),title = 
  'Setosa Iris Petal Lengths vs Petal Width')
SetLvsW <- SetLvsW %>%
  add_segments(x=lower_bound, xend=lower_bound, 
  y=min(setosaPetalWidth$Petal.Width), yend=max(setosaPetalWidth$Petal.Width),
  line = list(color = "blue", width = 3, dash = 'dash'),
  name="Lower Bound 95%CI")
SetLvsW <- SetLvsW %>%
 add_segments(x=upper_bound, xend=upper_bound,y=min(setosaPetalWidth$Petal.Width),
 yend =max(setosaPetalWidth$Petal.Width), line = list(color = "green", width = 3,
 dash = 'dash'),name="Upper Bound 95%CI")
SetLvsW <- SetLvsW %>%
 add_segments(x=meanPetalLength_Setosa, xend=meanPetalLength_Setosa,
 y=min(setosaPetalWidth$Petal.Width), yend =max(setosaPetalWidth$Petal.Width), 
 line = list(color = "red", width = 3),name="Sample Length Mean")
SetLvsW