## Warning: package 'haven' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'rstatix' was built under R version 4.2.3
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks rstatix::filter(), stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## Warning: package 'skimr' was built under R version 4.2.3
## Warning: package 'descr' was built under R version 4.2.3

Confidence Intervals

You can calculate confidence intervals for a single variable in a variety of ways. We will look at the following two approaches:

  1. Using Tidyverse to Manually Calculate CIs
  2. Use R function to calculate CIs using Student’s T distribution (note there is no similar approach in R to calculate the CI for a larger sample size distribution)

First thing to remember is that with sample sizes of >= 30, you can use the Gaussian distribution for your CI. With smaller sample sizes (< 30), you need to move to a Student’s t distribution.

We will use the built-in R dataset of mtcars for these examples. We review summary stats to understand general information about the data including the number of cases. From the skimr package, we see that we have 32 total observations. This means it is okay to use the Gaussian distribution however since it is close to 30 we will also look at the Student’s t CI. We will use the mpg variable from the mtcars dataset to examine the average miles per gallon of the 32 different carss in the dataset.

summary(mtcars) #Provides summary stats for all variables in a dataset 
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
skim(mtcars) #Provides more summary stats including observations for each variable 
Data summary
Name mtcars
Number of rows 32
Number of columns 11
_______________________
Column type frequency:
numeric 11
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 20.09 6.03 10.40 15.43 19.20 22.80 33.90 ▃▇▅▁▂
cyl 0 1 6.19 1.79 4.00 4.00 6.00 8.00 8.00 ▆▁▃▁▇
disp 0 1 230.72 123.94 71.10 120.83 196.30 326.00 472.00 ▇▃▃▃▂
hp 0 1 146.69 68.56 52.00 96.50 123.00 180.00 335.00 ▇▇▆▃▁
drat 0 1 3.60 0.53 2.76 3.08 3.70 3.92 4.93 ▇▃▇▅▁
wt 0 1 3.22 0.98 1.51 2.58 3.33 3.61 5.42 ▃▃▇▁▂
qsec 0 1 17.85 1.79 14.50 16.89 17.71 18.90 22.90 ▃▇▇▂▁
vs 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
am 0 1 0.41 0.50 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▆
gear 0 1 3.69 0.74 3.00 3.00 4.00 4.00 5.00 ▇▁▆▁▂
carb 0 1 2.81 1.62 1.00 2.00 2.00 4.00 8.00 ▇▂▅▁▁

Gaussian Distribution CIs - Tidyverse

First, let’s calculate the confidence interval using the large sample size (>= 30) approach and tidyverse.

cv <-1.96 #Set the critical value based on the confidence level you want. Here we use 1.96 for a confidence level of 95% 

gaussian_ci<-mtcars %>% 
  summarise(mean = mean(mpg),
            sd = sd(mpg), 
            n=n(), 
            se = sd / sqrt(n),
            ub = mean+(cv*se), 
            lb = mean-(cv*se),
            range=ub-lb)

print(gaussian_ci)
##       mean       sd  n       se       ub       lb    range
## 1 20.09062 6.026948 32 1.065424 22.17886 18.00239 4.176462
mtcars %>%
  get_summary_stats(mpg, type = "common") 
## # A tibble: 1 × 10
##   variable     n   min   max median   iqr  mean    sd    se    ci
##   <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mpg         32  10.4  33.9   19.2  7.38  20.1  6.03  1.06  2.17

In the above code, we used the formula reviewed in class to calculate the standard error and, along with the pre-specified critical value for a 95% confidence level, the confidence interval. We see that the mean MPG in the dataset (from 1974 so old data but principles still apply) is 20.09 with a standard deviation of 6.03. The SE, calculated as the SD / square root of the number of observations is equal to 1.065. To create the confidence level we simply add and subtract the sum of the critical value and SE to the mean, which gives us a 95% confidence interval with a lower bound of 18 mpg and upper bound of 22.2 mpg.

Student’s T CIs

Calculating the Student’s t confidence interval with tidyverse is a bit more difficult since the calculation of the critical value is not as straight foward as in the above approach. There are two differences in calculating the CI using the Student’s t distribution. First, is the use of degrees of freedom rather than the n-count and second is in how the critical value is calculated. You first create the alpha level you want to use - level of significance - and we use an alpha of .05 for a 95% confidence interval.

Once the Student’s t critical value is calculated, you simply multiply the standard error - which is calculated in the same way as previously - by the adjusted critical value. At small sample sizes, this adjustment will be much larger but as the sample size increases it will converge on the same value as the Gaussian approach above. We will also use the t.test function from the rstatix package to check that the CI is calculated correctly.

alpha<-.05 #Significance level you want to use in the CI calculation 

students_ci<-mtcars %>% 
  summarise(mean = mean(mpg),
            sd = sd(mpg), 
            n=n(),
            df=n-1, #Instead of N, you use the degrees of freedom in the Student's T calculation
            t_cv=qt(1-alpha/2, df), #And the Critical Value is slightly different here as well
            se = sd / sqrt(n),
            ub = mean+(t_cv*se), 
            lb = mean-(t_cv*se),
            range=ub-lb)

print(students_ci)
##       mean       sd  n df     t_cv       se       ub       lb    range
## 1 20.09062 6.026948 32 31 2.039513 1.065424 22.26357 17.91768 4.345893
t.test(mtcars$mpg, mu = mean(mtcars$mpg),  conf.level = 0.95) #This code calculates the 95% CI for the specified variable. Note it is technically code for a one-sample t-test but for creating the CI it doesn't matter.  
## 
##  One Sample t-test
## 
## data:  mtcars$mpg
## t = 0, df = 31, p-value = 1
## alternative hypothesis: true mean is not equal to 20.09062
## 95 percent confidence interval:
##  17.91768 22.26357
## sample estimates:
## mean of x 
##  20.09062

With the above code, we calculated the 95% confidence interval using the Student’s t distribution AND validated that the approach is accurate using the t.test function. The results are identical meaning we calculated the CI correctly using the tidyverse approach.

Now, let’s compare the CIs from the two different distributions. Since we already calculated and saved the CIs in the above code, here we just need to list the results.

list(
gaussian_ci, students_ci
)
## [[1]]
##       mean       sd  n       se       ub       lb    range
## 1 20.09062 6.026948 32 1.065424 22.17886 18.00239 4.176462
## 
## [[2]]
##       mean       sd  n df     t_cv       se       ub       lb    range
## 1 20.09062 6.026948 32 31 2.039513 1.065424 22.26357 17.91768 4.345893

Note that the confidence interval is slightly larger with the Student’s t distribution as expected with a small sample size of 32 observations. As that number grows, the CIs will converge on each other. The only reason there is a difference between the two CIs is because of the different way the critical value is calculated between the two distributions. With a sample size of 32, either approach is fine to use; although, the use of the Student’s t distribution is generally used with sample sizes of < 30.

Z-Scores

We finish this review by calculating the z-scores for each individual car and save it in a new dataset. Since To calculate the z-score, we need three pieces of information:

  1. Sample mean for the variable
  2. Sample standard deviation for the variable
  3. Value for the variable for each unit
new<-mtcars %>% #This calculates the z-score while keeping all of the original variables in the df
  mutate(mean = mean(mpg),
         n=n(), 
         var=var(mpg), 
         sd=sd(mpg), 
         zscore = (mpg - mean)/sd)

new2 <- mtcars %>% #This just selects for easier printing the variables we are interested in 
  rownames_to_column(var = "car_name") %>% # Add car names as a column
  mutate(mpg_org=mpg, 
    mean = mean(mpg),
         zscore = (mpg - mean) / sd(mpg)) %>%
  select(car_name, mpg_org, zscore) # Select only car_name, original mpg, and zscore columns

print(new2)
##               car_name mpg_org      zscore
## 1            Mazda RX4    21.0  0.15088482
## 2        Mazda RX4 Wag    21.0  0.15088482
## 3           Datsun 710    22.8  0.44954345
## 4       Hornet 4 Drive    21.4  0.21725341
## 5    Hornet Sportabout    18.7 -0.23073453
## 6              Valiant    18.1 -0.33028740
## 7           Duster 360    14.3 -0.96078893
## 8            Merc 240D    24.4  0.71501778
## 9             Merc 230    22.8  0.44954345
## 10            Merc 280    19.2 -0.14777380
## 11           Merc 280C    17.8 -0.38006384
## 12          Merc 450SE    16.4 -0.61235388
## 13          Merc 450SL    17.3 -0.46302456
## 14         Merc 450SLC    15.2 -0.81145962
## 15  Cadillac Fleetwood    10.4 -1.60788262
## 16 Lincoln Continental    10.4 -1.60788262
## 17   Chrysler Imperial    14.7 -0.89442035
## 18            Fiat 128    32.4  2.04238943
## 19         Honda Civic    30.4  1.71054652
## 20      Toyota Corolla    33.9  2.29127162
## 21       Toyota Corona    21.5  0.23384555
## 22    Dodge Challenger    15.5 -0.76168319
## 23         AMC Javelin    15.2 -0.81145962
## 24          Camaro Z28    13.3 -1.12671039
## 25    Pontiac Firebird    19.2 -0.14777380
## 26           Fiat X1-9    27.3  1.19619000
## 27       Porsche 914-2    26.0  0.98049211
## 28        Lotus Europa    30.4  1.71054652
## 29      Ford Pantera L    15.8 -0.71190675
## 30        Ferrari Dino    19.7 -0.06481307
## 31       Maserati Bora    15.0 -0.84464392
## 32          Volvo 142E    21.4  0.21725341

We can use the z-score to easily assess which values are the most “extreme” compared to the mean of the variable along with the amount of spread around that point-estimate. Since we have a relatively small sample size, there is typically fewer outliers in the data, which is what we see here. There are two large z-scores in the sample for the Fiat 128 - MPG = 32.4 and Z-score = 2.04 - plus the Toyota Corolla - MPG = 33.9 and Z-score = 2.29.

As you get more experienced working with data, the fact that there are no z-scores <-1.96 indicates a skew to the data with a long right-sided tail. In a few weeks, we will consider approaches to dealing with this type of skew.

Histogram

Let’s validate this by creating a histogram of the mpg variable using the ggplot package. In this code, make sure to use the appropriate dataframe and variable name in the first line of code. Then you might need to change the number of bins used depending on the total number of observations.

Do not delete the .. portion of the geom_histogram function as that is necessary to create the histogram. And always make sure that you update the title and axes names.

#Creates Histogram using GGplot 
ggplot(mtcars, aes(x = mpg)) + #Set the data to use PLUS the variable to graph 
  geom_histogram(aes(y = ..density..), position = "identity", 
                 bins = 8, alpha = 0.2, color="black") + #Might have to change the bin counts for appropriate visualization 
  labs(title = "Histogram of MPG from the mtcars Data ",
       x = "Miles per Gallon",
       y = "Density") +
  theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

For a discrete variable histogram - as compared to the continuous variable above - you simply need to use the freg function from the descr package. Here we use the number of cylinders variable, cyl to illustrate this point.

freq(mtcars$cyl)

## mtcars$cyl 
##       Frequency Percent
## 4            11   34.38
## 6             7   21.88
## 8            14   43.75
## Total        32  100.00