## Warning: package 'haven' was built under R version 4.2.3
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'rstatix' was built under R version 4.2.3
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks rstatix::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## Warning: package 'skimr' was built under R version 4.2.3
## Warning: package 'descr' was built under R version 4.2.3
You can calculate confidence intervals for a single variable in a variety of ways. We will look at the following two approaches:
First thing to remember is that with sample sizes of >= 30, you can use the Gaussian distribution for your CI. With smaller sample sizes (< 30), you need to move to a Student’s t distribution.
We will use the built-in R dataset of mtcars
for these
examples. We review summary stats to understand general information
about the data including the number of cases. From the
skimr
package, we see that we have 32 total observations.
This means it is okay to use the Gaussian distribution however since it
is close to 30 we will also look at the Student’s t CI. We will use the
mpg
variable from the mtcars
dataset to
examine the average miles per gallon of the 32 different carss in the
dataset.
summary(mtcars) #Provides summary stats for all variables in a dataset
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
skim(mtcars) #Provides more summary stats including observations for each variable
Name | mtcars |
Number of rows | 32 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
First, let’s calculate the confidence interval using the large sample
size (>= 30) approach and tidyverse
.
cv <-1.96 #Set the critical value based on the confidence level you want. Here we use 1.96 for a confidence level of 95%
gaussian_ci<-mtcars %>%
summarise(mean = mean(mpg),
sd = sd(mpg),
n=n(),
se = sd / sqrt(n),
ub = mean+(cv*se),
lb = mean-(cv*se),
range=ub-lb)
print(gaussian_ci)
## mean sd n se ub lb range
## 1 20.09062 6.026948 32 1.065424 22.17886 18.00239 4.176462
mtcars %>%
get_summary_stats(mpg, type = "common")
## # A tibble: 1 × 10
## variable n min max median iqr mean sd se ci
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 mpg 32 10.4 33.9 19.2 7.38 20.1 6.03 1.06 2.17
In the above code, we used the formula reviewed in class to calculate
the standard error and, along with the pre-specified critical value for
a 95% confidence level, the confidence interval. We see that the mean
MPG
in the dataset (from 1974 so old data but principles
still apply) is 20.09 with a standard deviation of 6.03. The SE,
calculated as the SD / square root of the number of observations is
equal to 1.065. To create the confidence level we simply add and
subtract the sum of the critical value and SE to the mean, which gives
us a 95% confidence interval with a lower bound of 18 mpg
and upper bound of 22.2 mpg
.
Calculating the Student’s t confidence interval with tidyverse is a bit more difficult since the calculation of the critical value is not as straight foward as in the above approach. There are two differences in calculating the CI using the Student’s t distribution. First, is the use of degrees of freedom rather than the n-count and second is in how the critical value is calculated. You first create the alpha level you want to use - level of significance - and we use an alpha of .05 for a 95% confidence interval.
Once the Student’s t critical value is calculated, you simply
multiply the standard error - which is calculated in the same way as
previously - by the adjusted critical value. At small sample sizes, this
adjustment will be much larger but as the sample size increases it will
converge on the same value as the Gaussian approach above. We will also
use the t.test
function from the rstatix
package to check that the CI is calculated correctly.
alpha<-.05 #Significance level you want to use in the CI calculation
students_ci<-mtcars %>%
summarise(mean = mean(mpg),
sd = sd(mpg),
n=n(),
df=n-1, #Instead of N, you use the degrees of freedom in the Student's T calculation
t_cv=qt(1-alpha/2, df), #And the Critical Value is slightly different here as well
se = sd / sqrt(n),
ub = mean+(t_cv*se),
lb = mean-(t_cv*se),
range=ub-lb)
print(students_ci)
## mean sd n df t_cv se ub lb range
## 1 20.09062 6.026948 32 31 2.039513 1.065424 22.26357 17.91768 4.345893
t.test(mtcars$mpg, mu = mean(mtcars$mpg), conf.level = 0.95) #This code calculates the 95% CI for the specified variable. Note it is technically code for a one-sample t-test but for creating the CI it doesn't matter.
##
## One Sample t-test
##
## data: mtcars$mpg
## t = 0, df = 31, p-value = 1
## alternative hypothesis: true mean is not equal to 20.09062
## 95 percent confidence interval:
## 17.91768 22.26357
## sample estimates:
## mean of x
## 20.09062
With the above code, we calculated the 95% confidence interval using
the Student’s t distribution AND validated that the approach is accurate
using the t.test
function. The results are identical
meaning we calculated the CI correctly using the tidyverse
approach.
Now, let’s compare the CIs from the two different distributions.
Since we already calculated and saved the CIs in the above code, here we
just need to list
the results.
list(
gaussian_ci, students_ci
)
## [[1]]
## mean sd n se ub lb range
## 1 20.09062 6.026948 32 1.065424 22.17886 18.00239 4.176462
##
## [[2]]
## mean sd n df t_cv se ub lb range
## 1 20.09062 6.026948 32 31 2.039513 1.065424 22.26357 17.91768 4.345893
Note that the confidence interval is slightly larger with the Student’s t distribution as expected with a small sample size of 32 observations. As that number grows, the CIs will converge on each other. The only reason there is a difference between the two CIs is because of the different way the critical value is calculated between the two distributions. With a sample size of 32, either approach is fine to use; although, the use of the Student’s t distribution is generally used with sample sizes of < 30.
We finish this review by calculating the z-scores for each individual car and save it in a new dataset. Since To calculate the z-score, we need three pieces of information:
new<-mtcars %>% #This calculates the z-score while keeping all of the original variables in the df
mutate(mean = mean(mpg),
n=n(),
var=var(mpg),
sd=sd(mpg),
zscore = (mpg - mean)/sd)
new2 <- mtcars %>% #This just selects for easier printing the variables we are interested in
rownames_to_column(var = "car_name") %>% # Add car names as a column
mutate(mpg_org=mpg,
mean = mean(mpg),
zscore = (mpg - mean) / sd(mpg)) %>%
select(car_name, mpg_org, zscore) # Select only car_name, original mpg, and zscore columns
print(new2)
## car_name mpg_org zscore
## 1 Mazda RX4 21.0 0.15088482
## 2 Mazda RX4 Wag 21.0 0.15088482
## 3 Datsun 710 22.8 0.44954345
## 4 Hornet 4 Drive 21.4 0.21725341
## 5 Hornet Sportabout 18.7 -0.23073453
## 6 Valiant 18.1 -0.33028740
## 7 Duster 360 14.3 -0.96078893
## 8 Merc 240D 24.4 0.71501778
## 9 Merc 230 22.8 0.44954345
## 10 Merc 280 19.2 -0.14777380
## 11 Merc 280C 17.8 -0.38006384
## 12 Merc 450SE 16.4 -0.61235388
## 13 Merc 450SL 17.3 -0.46302456
## 14 Merc 450SLC 15.2 -0.81145962
## 15 Cadillac Fleetwood 10.4 -1.60788262
## 16 Lincoln Continental 10.4 -1.60788262
## 17 Chrysler Imperial 14.7 -0.89442035
## 18 Fiat 128 32.4 2.04238943
## 19 Honda Civic 30.4 1.71054652
## 20 Toyota Corolla 33.9 2.29127162
## 21 Toyota Corona 21.5 0.23384555
## 22 Dodge Challenger 15.5 -0.76168319
## 23 AMC Javelin 15.2 -0.81145962
## 24 Camaro Z28 13.3 -1.12671039
## 25 Pontiac Firebird 19.2 -0.14777380
## 26 Fiat X1-9 27.3 1.19619000
## 27 Porsche 914-2 26.0 0.98049211
## 28 Lotus Europa 30.4 1.71054652
## 29 Ford Pantera L 15.8 -0.71190675
## 30 Ferrari Dino 19.7 -0.06481307
## 31 Maserati Bora 15.0 -0.84464392
## 32 Volvo 142E 21.4 0.21725341
We can use the z-score to easily assess which values are the most “extreme” compared to the mean of the variable along with the amount of spread around that point-estimate. Since we have a relatively small sample size, there is typically fewer outliers in the data, which is what we see here. There are two large z-scores in the sample for the Fiat 128 - MPG = 32.4 and Z-score = 2.04 - plus the Toyota Corolla - MPG = 33.9 and Z-score = 2.29.
As you get more experienced working with data, the fact that there are no z-scores <-1.96 indicates a skew to the data with a long right-sided tail. In a few weeks, we will consider approaches to dealing with this type of skew.
Let’s validate this by creating a histogram of the mpg
variable using the ggplot
package. In this code, make sure
to use the appropriate dataframe and variable name in the first line of
code. Then you might need to change the number of bins
used
depending on the total number of observations.
Do not delete the ..
portion of the
geom_histogram
function as that is necessary to create the
histogram. And always make sure that you update the title and axes
names.
#Creates Histogram using GGplot
ggplot(mtcars, aes(x = mpg)) + #Set the data to use PLUS the variable to graph
geom_histogram(aes(y = ..density..), position = "identity",
bins = 8, alpha = 0.2, color="black") + #Might have to change the bin counts for appropriate visualization
labs(title = "Histogram of MPG from the mtcars Data ",
x = "Miles per Gallon",
y = "Density") +
theme_minimal()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
For a discrete variable histogram - as compared to the continuous
variable above - you simply need to use the freg
function
from the descr
package. Here we use the number of cylinders
variable, cyl
to illustrate this point.
freq(mtcars$cyl)
## mtcars$cyl
## Frequency Percent
## 4 11 34.38
## 6 7 21.88
## 8 14 43.75
## Total 32 100.00