We should already be familiar with the difference between “inferential statistics” and “predictive statistics”, but in this notebook, we’ll dig deeper into inference itself, and investigate what it looks like in the context of a data set.
We’ll go over the following topics:
Statistical Inference consists of two parts:
For example, when we calculate the mean of a column of data, you could argue that we are inferring something about the “true” mean of its corresponding population. We can measure our reliability of this inference (or estimate) by investigating aspects of the data which might challenge our conclusion.
For this notebook, let’s use the gapminder dataset.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(ggthemes)
options(scipen = 0)
Before we can draw conclusions about a population, let alone evaluate those conclusions, it benefits us to have as much information as we can gather about the population itself. Given some set of data, we do have several columns of data (i.e., observations of different variables), but often there are latent relationships in the data which are only made apparent by building columns based on others. Here, we’ll discuss a few different ways to build on the columns we have.
Note:
- the different ways you can create new columns is only limited by your imagination! Here, we only illustrate a few common examples.
- of course, these methods can be applied to all columns, including ones that have been created! So, you can see how the process can build on itself.
One way to gather a different “view” of our data is to compare each value in a column with a single value. That is, a deviation column is a subtraction operation: the difference between a row value and a constant value, illustrating it’s deviation from the predefined constant value. One common example of this is the difference between a row value and its corresponding mean within a group.
Before mutating a data frame, save the “raw” version. Note: R makes a copy of an object when you assign it to a different variable name. (E.g., see the Data in the environment of this notebook after running all the cells.)
# let's start by saving the original dataset as "raw"
gapminder_raw <- gapminder_unfiltered
Now, let’s start by cleaning up the names of this data frame, then simply capturing the difference between the Year and the current year. So, we’re creating a column representing the number of years since the year represented for that row.
# we choose to overwrite the "filtered" gm data
# uncomment this to see column names
# colnames(gapminder_unfiltered)
gapminder <- gapminder_unfiltered |>
# rename columns
rename(life_exp = "lifeExp",
population = "pop",
gdp_per_cap = "gdpPercap") |>
# get deviation column
mutate(years_since = year(now()) - year)
gapminder |>
select(years_since) |>
summary()
## years_since
## Min. :17.00
## 1st Qu.:28.00
## Median :42.00
## Mean :43.71
## 3rd Qu.:57.00
## Max. :74.00
We can also calculate a deviation column that is dependent on other columns. In this case, suppose we are interested in how a country’s life expectancy (for some year) measures up to the average life expectancy for similar countries (in the same continent) during the same year.
We’ve seen the group_by-summarize
combination, but here we use a group_by-mutate
combination, to create columns (based on groups). The syntax is
similar.
# it's best to test things out before setting variables
gapminder <- gapminder |>
group_by(continent, year) |>
mutate(life_exp_dev = life_exp - mean(life_exp), # deviation
life_exp_avg = mean(life_exp)) |> # group average
arrange(continent, year)
gapminder |>
select(continent, year, country, life_exp_avg,
life_exp, life_exp_dev)
## # A tibble: 3,313 × 6
## # Groups: continent, year [286]
## continent year country life_exp_avg life_exp life_exp_dev
## <fct> <int> <fct> <dbl> <dbl> <dbl>
## 1 Africa 1950 Libya 41.4 42.7 1.36
## 2 Africa 1950 Uganda 41.4 40 -1.36
## 3 Africa 1952 Algeria 39.3 43.1 3.77
## 4 Africa 1952 Angola 39.3 30.0 -9.30
## 5 Africa 1952 Benin 39.3 38.2 -1.09
## 6 Africa 1952 Botswana 39.3 47.6 8.31
## 7 Africa 1952 Burkina Faso 39.3 32.0 -7.34
## 8 Africa 1952 Burundi 39.3 39.0 -0.281
## 9 Africa 1952 Cameroon 39.3 38.5 -0.789
## 10 Africa 1952 Cape Verde 39.3 48.5 9.17
## # ℹ 3,303 more rows
You’ll notice at the top of the output that this data frame is
“grouped” by continent and year. After adding
the new column, we don’t really need the groups anymore, so we can use
ungroup to un-group the data.
gapminder <- ungroup(gapminder)
This method of performing some calculation (e.g., subtraction) within
groups can be applied to any expression! As a simple example, let’s
assume that the only countries in each continent are listed
correctly in this dataset. Create a perc_of_continent
column which represents the proportion of a country’s continent that it
makes up (i.e., by population).
print("your code here")
## [1] "your code here"
Another calculation we can make is rank ordering continuous values.
So, for each year, we can rank order the populations of each country
reported that year. Note: by default, R ranks with the highest
number receiving the highest rank. Use desc to
reverse this.
gapminder <-
gapminder |>
group_by(year) |>
mutate(pop_rank = rank(population),
pop_rank_desc = rank(desc(population))) |>
ungroup()
gapminder |>
arrange(year, pop_rank) |>
select(year, population, pop_rank, pop_rank_desc)
## # A tibble: 3,313 × 4
## year population pop_rank pop_rank_desc
## <int> <int> <dbl> <dbl>
## 1 1950 65797 1 39
## 2 1950 142938 2 38
## 3 1950 295587 3 37
## 4 1950 866982 4 36
## 5 1950 961305 5 35
## 6 1950 1908310 6 34
## 7 1950 2218000 7 33
## 8 1950 2336432 8 32
## 9 1950 3265126 9 31
## 10 1950 3463446 10 30
## # ℹ 3,303 more rows
A simple conversion from categorical to numeric values is a count. In this case, we can look at the number of distinct countries within a continent.
gapminder |>
group_by(continent) |>
mutate(num_countries = n_distinct(country),
num_rows = n()) |>
select(continent, country, num_countries, num_rows)
## # A tibble: 3,313 × 4
## # Groups: continent [6]
## continent country num_countries num_rows
## <fct> <fct> <int> <int>
## 1 Africa Libya 53 637
## 2 Africa Uganda 53 637
## 3 Africa Algeria 53 637
## 4 Africa Angola 53 637
## 5 Africa Benin 53 637
## 6 Africa Botswana 53 637
## 7 Africa Burkina Faso 53 637
## 8 Africa Burundi 53 637
## 9 Africa Cameroon 53 637
## 10 Africa Cape Verde 53 637
## # ℹ 3,303 more rows
We can also take a continuous variable, say population,
and convert it into a categorical variable based on either predefined
groups or percentiles.
Starting with groups, let’s create a column that reflects whether or not a country’s population was above the median population (i.e., in the upper half) for that continent that year, or below the median (i.e., the lower half).
gapminder <-
gapminder |>
group_by(continent, year) |>
mutate(pop_median = median(population),
pop_half = ifelse(population >= median(population),
"upper half",
"lower half")) |>
ungroup()
# plotting just for a randomly selected year
plot_year <- sample(gapminder$year, 1)
gapminder |>
filter(year == plot_year) |>
ggplot() +
geom_point(mapping = aes(x = population,
y = continent,
colour = pop_half)) +
scale_x_log10(labels=\(x) paste(x/1e6, "M")) +
annotation_logticks(sides = 'b') +
scale_colour_brewer(palette = "Dark2") +
theme_hc() +
labs(title = paste("Population 50th percentiles for the year",
plot_year))
So, we can see that this is doing what we want it to be doing.
Alternatively, we can cut the population into multiple groups based
on some numbers (e.g., the order of magnitude or percentiles), or a
predefined number of equally sized groups. The different options we have
can be seen in all the cut_ … functions (see the Help
panel). For now, we’ll just use the built-in cut function
to break the populations up based on their order of magnitude.
How can we check the min/maximum orders of magnitude?
# breaks need to cover lowest and highest values
oom_breaks <- 10 ^ c(4, 5, 6, 7, 8, 9, 10)
# there should be one LESS label (these are "in between breaks")
oom_labels <- c("10-100K", "100K-1M", "1-10M", "10-100M", "100M-1B", ">1B")
gapminder <-
gapminder |>
mutate(pop_oom = cut(population, breaks = oom_breaks,
labels = oom_labels, right = TRUE))
gapminder |>
select(country, population, pop_oom) |>
sample_n(10)
## # A tibble: 10 × 3
## country population pop_oom
## <fct> <int> <fct>
## 1 Finland 4738902 1-10M
## 2 Italy 57179460 10-100M
## 3 Djibouti 71851 10-100K
## 4 United States 242803533 100M-1B
## 5 Latvia 2340205 1-10M
## 6 Morocco 25798239 10-100M
## 7 India 372000000 100M-1B
## 8 United Kingdom 60776238 10-100M
## 9 Iraq 7240260 1-10M
## 10 Japan 120754335 100M-1B
Use one of the ggplot2 cut_ functions to create a column
of data that breaks the countries into 5 equal groups based on
population:
"very small", "small", "middle sized", "large", "very large".
We can group our data based on country, and for each
country, we can use the lag function in dplyr
to calculate the change in GDP Per Capita from year to year. To do this,
we’ll need to order_by year. Again, we need to
group our columns. Otherwise, the first year of country B is
going to be compared to the last year of country A.
gapminder |>
group_by(country) |>
mutate(gdp_lag_1 = lag(gdp_per_cap, n = 1, order_by = year)) |>
arrange(country, year) |>
select(country, year, gdp_per_cap, gdp_lag_1)
## # A tibble: 3,313 × 4
## # Groups: country [187]
## country year gdp_per_cap gdp_lag_1
## <fct> <int> <dbl> <dbl>
## 1 Afghanistan 1952 779. NA
## 2 Afghanistan 1957 821. 779.
## 3 Afghanistan 1962 853. 821.
## 4 Afghanistan 1967 836. 853.
## 5 Afghanistan 1972 740. 836.
## 6 Afghanistan 1977 786. 740.
## 7 Afghanistan 1982 978. 786.
## 8 Afghanistan 1987 852. 978.
## 9 Afghanistan 1992 649. 852.
## 10 Afghanistan 1997 635. 649.
## # ℹ 3,303 more rows
Notice here, the “lag” is capturing the “last” (or “previous”) value. Also, since the difference in years might not always be consistent for each country, we should log that information, too.
gapminder <-
gapminder |>
group_by(country) |>
mutate(gdp_diff = gdp_per_cap - lag(gdp_per_cap, n = 1,
order_by = year),
year_diff = year - lag(year, n = 1,
order_by = year)) |>
ungroup()
gapminder |>
arrange(country, year) |>
select(country, year, gdp_per_cap, gdp_diff, year_diff)
## # A tibble: 3,313 × 5
## country year gdp_per_cap gdp_diff year_diff
## <fct> <int> <dbl> <dbl> <int>
## 1 Afghanistan 1952 779. NA NA
## 2 Afghanistan 1957 821. 41.4 5
## 3 Afghanistan 1962 853. 32.2 5
## 4 Afghanistan 1967 836. -16.9 5
## 5 Afghanistan 1972 740. -96.2 5
## 6 Afghanistan 1977 786. 46.1 5
## 7 Afghanistan 1982 978. 192. 5
## 8 Afghanistan 1987 852. -126. 5
## 9 Afghanistan 1992 649. -203. 5
## 10 Afghanistan 1997 635. -14.0 5
## # ℹ 3,303 more rows
Notice how the lag between GDP observations is not always 5 years …
gapminder |>
count(year)
## # A tibble: 58 × 2
## year n
## <int> <int>
## 1 1950 39
## 2 1951 24
## 3 1952 144
## 4 1953 24
## 5 1954 24
## 6 1955 24
## 7 1956 24
## 8 1957 144
## 9 1958 25
## 10 1959 25
## # ℹ 48 more rows
A simple method for creating a new column was already used above, but we’ll formalize it here. Any column can be turned into a binary column by defining some sort of [this] and thus a [not this]. For example, we can simply determine whether a country is our home country or not:
gapminder <-
gapminder |>
mutate(is_us = country == "United States")
gapminder |>
count(is_us)
## # A tibble: 2 × 2
## is_us n
## <lgl> <int>
## 1 FALSE 3256
## 2 TRUE 57
There are tons of ways you can manipulate strings in R. But, we’ll use a simple example here by determining the number of “names” within a country name. E.g., the “United States” has 2 “names”, and “Afghanistan” only has 1.
# here, we apply the `length` function to each split string
ct <- map_int(str_split(gapminder$country, " "), length)
# note: the vector is the same, so order is conserved
gapminder |>
mutate(country_name_ct = ct) |>
count(country_name_ct)
## # A tibble: 4 × 2
## country_name_ct n
## <int> <int>
## 1 1 2724
## 2 2 467
## 3 3 98
## 4 4 24
Or, we can do this more efficiently using str_count to
count the number of spaces, and add 1.
gapminder <- gapminder |>
mutate(country_name_ct = str_count(country, " ") + 1)
gapminder |>
count(country_name_ct)
## # A tibble: 4 × 2
## country_name_ct n
## <dbl> <int>
## 1 1 2724
## 2 2 467
## 3 3 98
## 4 4 24
There are plenty of functions you can run on dates using the lubridate package. Since the Gapminder dataset does not contain a full date column (i.e., we only have the year column), we will leave it to the reader to explore possible ways to build columns using this package on your own data. As you create date columns, try to apply some of the methods described above.
Recall that a parameter is an unknown value describing some attribute of a population or population distribution. When we “draw conclusions about a population”, we are estimating (inferring) parameter values based on our data. But, we could have selected some other data …
With over 500 rows of data, we should have enough data to simulate pulling other samples we could have collected using random sampling. Bootstrapping is using all our data as a stand-in for the full population, and random sampling as a stand-in for recreation of our data collection.
Take the life expectancy for all the countries in the 21st century:
# saving this vector for a bootstrapping example
life_exp_2000s <-
gapminder |>
filter(year >= 2000) |>
pluck("life_exp")
ggplot() +
geom_histogram(mapping = aes(x = life_exp_2000s),
colour='white') +
theme_hc()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This histogram looks quite smooth (i.e., there are no significant breaks in the data), and we have over 500 rows of data, so we are probably safe to bootstrap from it.
A bootstrapped sample is the size of the original dataset, done with replacement.
bootstrap <- function (x, func=mean, n_iter=10^4) {
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample
x_sample <- sample(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
life_exp_means <- bootstrap(life_exp_2000s)
ggplot() +
geom_histogram(mapping = aes(x = life_exp_means),
color='white') +
labs(title = "10K Bootstrapped Sample Means of `life_exp_2000s`",
subtitle = "A *simulation* of the true sampling distribution",
x = "Bootstrapped Sample Mean",
y = "Number of Samples") +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What we see above is an estimation for the shape of our
sampling distribution. Each sample mean in life_exp_means
is represented in this histogram, and we can see that some sample means
are larger than others. The standard deviation of the
(true) sampling distribution is called the standard
error. It is a measure for the dispersion of sample means, and
informs the expected “error” of a sample mean against the true mean of
the population. We can estimate it by calculating the standard deviation
of our bootstrapped sample means, life_exp_means.
life_exp_se_boot <- sd(life_exp_means)
life_exp_se_boot
## [1] 0.4416655
If the probability distribution of the life_exp
population were roughly normal, then we can estimate the standard error
(of a sample mean) using the following function:
\[ \text{SE}_\text{mean} = \frac{s}{\sqrt{n}} \]
where \(s\) is an estimation for the population standard deviation using the standard deviation of our sample of data, and \(n\) is the number of rows.
life_exp_se_est <- sd(life_exp_2000s) / sqrt(length(life_exp_2000s))
life_exp_se_est
## [1] 0.4439835
This function is most reliable for calculating the standard error of the mean of a normally distributed data set. But, we may want a standard error of any other statistic (e.g., median, max, etc.) for any distributed data. So, in general, if you have sufficient data, use bootstrapping to calculate standard error. Traditionally, over 100 rows is enough, but over 1K rows is even better.
Of course, we can use the bootstrapping method to estimate parameters other than the mean …
Suppose we’re interested in the mean life expectancy over all countries in the world (i.e., where we assume each country in the “world” is represented in our data); this unknown value is our population parameter. Now, we do not know (nor will we ever know) this “true” mean for the population, so we should want to measure the uncertainty of our estimates, and confidence intervals are a commonly useful way to do this.
avg_life_exp <- mean(life_exp_2000s)
avg_life_exp
## [1] 70.37521
In fact, our data represents only one sample from the population
(i.e., the life_exp for each row could have been slightly
different based on a number of factors). Again, we don’t know the true
sampling population, but we can propose a possible one:
# *rounded* mean and std dev from our sample of data
f_life_exp <- \(x) dnorm(x, mean = 70, sd = 0.4)
ggplot() +
geom_function(xlim = c(68.5, 71.5),
fun = f_life_exp) +
geom_segment(mapping = aes(x = avg_life_exp - 0.4,
y = f_life_exp(avg_life_exp),
xend = avg_life_exp + 0.4,
yend = f_life_exp(avg_life_exp),
linetype = "proposed interval"),
color = "gray") +
geom_point(mapping = aes(x = avg_life_exp,
y = f_life_exp(avg_life_exp),
color = "our sample"), size = 2) +
labs(title = "*Possible* Sampling Distribution for `life_exp_2000s`",
subtitle ='In this case, the "true" mean is 70 years',
x = "Sample Mean",
y = "Probability Density",
color = "",
linetype = "") +
theme_minimal()
Who knows. Maybe the sample mean we’re collecting is actually overestimating the true mean. We cannot know for sure. But, we can create an interval which we expect to contain the true mean with some defined level of confidence. So, whenever we draw a conclusion from a confidence interval, we assume our sample is representative of the population.
Recall, the standard error represents the standard deviation of the sampling distribution. Also, note that the sampling distribution will always be a normal (Gaussian) distribution, due to the Central Limit Theorem. So, we know that the interval \((\mu - \text{SE}, \mu + \text{SE})\) should contain approximately 68.3% of all the sample means we could have calculated (possibly even our own), given true mean \(\mu\) and standard deviation \(\sigma=\text{SE}\).
# we could pick anything here ...
mu_ <- 2
se_ <- 4
z_ <- 1 # number of stdevs from the mean
# the percentage of values within k stdevs from the mean
pnorm(mu_ + z_ * se_, mu_, se_) - pnorm(mu_ - z_ * se_, mu_, se_)
## [1] 0.6826895
In other words, if we know ahead of time that the value \(z^*\) corresponds to the number of standard deviations from the mean which captures \(\mathcal{P}\) percent of the data, we can say:
\[ P(\mu - z^* \cdot \text{SE} < \bar{x} < \mu + z^* \cdot \text{SE}) = \mathcal{P} \]
for any sample mean \(\bar{x}\) from the sampling distribution with mean \(\mu\) and standard deviation \(\text{SE}\). Now, let’s convert our sampling distribution (and in turn, our sample mean \(\bar{x}\)) to the standard normal distribution. That is, subtract each element by \(\mu\) and divide by the estimated standard error, \(s/\sqrt{n} \approx \text{SE}\). Then, for a sufficiently large sample of data, we have
\[ \begin{align} P(-z^* < \frac{\bar{x} - \mu}{s/\sqrt{n}} < z^*) &\approx \mathcal{P} \\ \\ \to P(\bar{x} - z^*\frac{s}{\sqrt{n}} < \mu < \bar{x} + z^*\frac{s}{\sqrt{n}}) &\approx \mathcal{P} \end{align} \]
This interval \(\displaystyle\bar{x}\pm z^* \frac{s}{\sqrt{n}}\) represents a confidence interval:
A \(\mathcal{P}\)-percent C.I. is expected to include the true mean \(\mu\) of a distribution \(\mathcal{P}\)-percent of the time. That is, if we create the same 95% C.I. using 100 different samples of data from the same distribution, then 95 of them are expected to contain the population mean.
We call this \(z^*\) the “Z-score”, and it corresponds to the “critical value” which yields the \(\mathcal{P}\) percent of data described for the standard normal distribution. That is, to the right of \(z^*\), we should have \((1 - \mathcal{P})/2\) percent of the data. So,
P <- 0.95 # % confidence
# z score (we need (1 - P)/2 of data to the right/upper tail)
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
ggplot() +
geom_function(xlim = c(-4, 4), fun = function(x) dnorm(x)) +
geom_vline(mapping = aes(xintercept = c(-z_score, z_score)),
color = "orange") +
geom_text(mapping = aes(x = -z_score - 0.1, y = .25),
label = "z-score (2.5%)", color = "orange", hjust = "right") +
geom_text(mapping = aes(x = z_score + 0.1, y = .25),
label = "z-score (97.5%)", color = "orange", hjust = "left") +
scale_x_continuous(breaks = -4:4) +
labs(title = "Standard Normal Curve",
x = "Standard Deviations from Mean 0",
y = "Probability Density") +
theme_minimal() +
theme(legend.position="none")
Again, \(\mathcal{P}\) percent of the data lies between the orange vertical lines.
We’ve seen above that it’s best to use the bootstrapped S.E. over the \(s/\sqrt{n}\) calculation. So we’ll use that in our confidence interval instead.
Also, you may encounter a smaller sample of data (e.g., less than 100 rows) which is too small to approximate samples from the normal distribution. In these cases, you’d use the “t-statistic” which is the same “critical value”, but from the t-distribution which has a wider shape for smaller samples, and approximates the normal distribution for larger ones (in which case, the z-score works just as well).
Let’s compose a C.I. for the life expectancy based on our estimated standard error. Keep in mind, we want the middle \(1 - \mathcal{P}\) of the distribution, so
P <- .95 # % confidence
n <- length(life_exp_2000s)
# t-statistic with n-1 degrees of freedom
t_star <- qt(p = (1 - P)/2, df=n - 1, lower.tail=FALSE)
# t half-width
CI_t <- t_star * life_exp_se_boot
# z-score
z_score <- qnorm(p=(1 - P)/2, lower.tail=FALSE)
# z half-width
CI_z <- z_score * life_exp_se_boot
c(avg_life_exp - CI_t, avg_life_exp + CI_t) # t-distribution
## [1] 69.50765 71.24278
c(avg_life_exp - CI_z, avg_life_exp + CI_z) # normal distribution
## [1] 69.50956 71.24086
It makes sense that these are very close, given that our data is not small (i.e., well over 100 rows).
Interpret the above results. How would you explain either confidence interval in your own words? Hint: See below for some inspiration …
# for the sake of example, get 100 means and their standard error
boot_n <- 100
boot_means <- bootstrap(life_exp_2000s, n_iter = boot_n)
boot_se <- sd(boot_means)
true_mean <- mean(life_exp_2000s)
t_star <- qt(p = (1 - P)/2, df=boot_n - 1, lower.tail=FALSE)
lowers <- boot_means - t_star * boot_se
uppers <- boot_means + t_star * boot_se
mask <- (lowers < true_mean) & (true_mean < uppers)
ggplot() +
geom_segment(mapping = aes(x = lowers, xend = uppers,
y = 1:length(boot_means),
yend = 1:length(boot_means),
color = mask)) +
geom_vline(xintercept = true_mean, color = "orange") +
scale_color_manual(values = c("darkgray", "red"),
breaks = c(TRUE, FALSE)) +
theme_minimal() +
theme(legend.position="none") +
labs(title = "100 Confidence Intervals for `life_exp`",
x = "CI boundaries",
y = "Iteration")
One particular issue with using either of these parametric confidence intervals (e.g., “t-” or normal) is that it forces us to make an explicit estimate for a population parameter (i.e., the standard error), which we do not have access to. Further, it confines us to only C.I.s for the sample mean, which might not be the parameter we’re interested in. So, instead, we can use bootstrapping to completely define our confidence intervals.
library(boot)
We’ll use the boot library to build a function which calculates a percentage confidence interval.
# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
# the `boot` library needs the function in this format
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
For example, we can compare the results of this function to the mean bootstrap we calculated above:
boot_ci(life_exp_2000s, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (69.53, 71.27 )
## Calculations and Intervals on Original Scale
(They look pretty similar.) Or, we can calculate a new one, e.g., for the median:
boot_ci(life_exp_2000s, median, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% (72.47, 74.24 )
## Calculations and Intervals on Original Scale
Here, again, we simulate pulling \(N\) samples from the population by bootstrapping \(N\) times from our data set, with replacement. For each bootstrap, we calculate the function value (e.g., mean, median, etc.), then determine the t-statistic or z-statistic which best captures \(\mathcal{P}\) percent of the data.
So far, we’ve been interested only in one variable, doing our best to understand that single variable based on our data. However, it’s more likely that we want to understand how variables are related to one another, and eventually how one affects another. To start to understand how two variables relate to one another, we need to understand how they vary together.
For one variable, variance measure how much each point deviates from the mean. For two variables, covariance measures how much pairs of points deviate from their means together. It is concerned with either of the following:
cov > 0)cov < 0)cov = 0)\[ \text{cov}(X, Y) = \sigma_{X,Y} = E[(X - \mu_{X})(Y - \mu_{Y})] \\ \text{sample calculation} \to \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{n - 1} \]
Note: R uses the Bonferroni Correction \(n-1\) for the calculation for an unbiased estimate (i.e., imagine what this would be for one value of \(X\) and \(Y\) …)
The covariance is the expected value for the amount to which a variable has deviated from its mean, scaled by the amount to which another variable has deviated from its mean in the same direction. It is a measure of linear dependence between two random variables.
w <- rep(1, 10)
x <- 1:10
y <- 10:1
z <- rep(c(1, 2), 5)
# explore different combinations of these
cov(x, y)
## [1] -9.166667
For example, we can start by looking at the relationship between year and life expectancy for the country Portugal.
plot_data <-
gapminder |>
filter(country == "Portugal")
plot_data |>
ggplot() +
geom_point(mapping = aes(x = population, y = life_exp)) +
labs(x = "Population", y = "Life Expectancy",
title = "Life Expectancy as Population Increases",
subtitle = paste("Covariance:",
round(cov(plot_data$population,
plot_data$life_exp), 2))) +
theme_minimal()
When we compare this covariance value with the covariance of years_since and the pop_rank, all we can really determine is that one is slightly negative, and the other is much more positive.
round(cov(plot_data$years_since, plot_data$pop_rank), 2)
## [1] -54.24
This illustrates a critical issue: the units of the covariance cannot easily be tied to anything useful (even if the sign can).
However, it can be shown that the maximum value of the covariance is the product of the variances for the two variables. This is a result of the Cauchy-Schwarz Inequality.
\[ |E[(X - \mu_{X})(Y - \mu_{Y})]| \leq \sigma_X\sigma_Y \]
So, we can use this to scale our covariance to calculate a “correlation coefficient”.
When we divide the covariance by the product of standard deviations for \(X\) and \(Y\), we have Pearson’s Correlation Coefficient. This is a normalized version of the covariance, and it measures the strength of the linear relationship between the variables.
\[ \text{corr}(X, Y) = \rho_{X,Y} = \frac{E[(X - \mu_{X})(Y - \mu_{Y})]}{\sigma_X\sigma_Y} \\ \text{sample calculation} \to r_{x,y} = \frac{\sum_{i=1}^{n} (x_{i} - \bar{x})(y_{i} - \bar{y})}{(n-1)s_x s_y} \]
round(cor(plot_data$population, plot_data$life_exp), 2)
## [1] 0.97
round(cor(plot_data$years_since, plot_data$pop_rank), 2)
## [1] -0.08
Here, we can see that the linear relationship between population and life_exp is significantly stronger than the linear relationship between years_since and pop_rank. We also have a better picture for the positivity/negativity of both.
Pearson’s correlation coefficient requires continuous values, were the division operation makes sense. But, for discrete values with an order (e.g., “small”, “medium”, “large”) where “increasing/decreasing together” can still make sense, we should still have a measure of correlation. With these, we can use the rank of these values to calculate the Spearman’s Rho or Spearman’s Rank Correlation.
\[ \rho_{R(X),R(Y)} = \frac{\text{cov}(\text{R}(X), \text{R}(Y))}{\sigma_{R(X)}\sigma_{R(Y)}} \\ \text{sample calculation} \to r_{R(x),R(y)} = \frac{\text{cov}(\text{R}(x), \text{R}(y))}{s_{R(x)}s_{R(y)}} \]
Where \(R\) represents the rank order of the discrete variable. For example, we can use Spearman’s Rho on the pop_oom column we made earlier, which contains string values representing the order of magnitude for the population.
gapminder |>
mutate(pop_oom_rank = as.numeric(pop_oom)) |>
select(pop_oom, pop_oom_rank) |>
unique() |>
arrange(pop_oom)
## # A tibble: 6 × 2
## pop_oom pop_oom_rank
## <fct> <dbl>
## 1 10-100K 1
## 2 100K-1M 2
## 3 1-10M 3
## 4 10-100M 4
## 5 100M-1B 5
## 6 >1B 6
And, to calculate Spearman’s Rho, with this value and the
country_name_ct, we use as.numeric.
# we use `as.numeric` to get the factor level rank
s_rho <- cor(x = as.numeric(gapminder$pop_oom),
y = gapminder$country_name_ct,
method = "spearman")
s_rho
## [1] -0.06403513
Note: R encodes the result of
cut_as factor levels with an order, and the same for TRUE/FALSE and integer values. Otherwise, you may need to usefactororas_factorto set your discrete variable as you like before usingas.numeric.
gapminder |>
group_by(pop_oom, country_name_ct) |>
count() |>
ggplot() +
geom_point(mapping = aes(x = pop_oom, y = country_name_ct, size=n)) +
labs(x = "Population Order of Magnitude",
y = "Country Name Length (in words)",
title = "Length of Country Name vs. Population",
subtitle = paste("Spearman's Rho = ", round(s_rho, 2)),
size = "Number of Rows") +
theme_minimal()
Spearman’s Rho here does capture a slight decreasing trend here, but we notice the trend is very small, and it might not be worth mentioning given that the bulk of the data does not reflect the relative change.
As we’ve seen, two variables can be interesting to investigate on their own, but it’s also interesting to investigate how one affects the other. In these cases, we should designate a response variable which we expect to be affected by an explanatory variable. You may also see these described in other terms:
| Explanatory | Response |
|---|---|
| Independent | Dependent |
| Exogenous | Endogenous |
| Feature | Target |
A few examples of these could be:
In other words, it should make sense for an explanatory variable to affect a response variable, but not the other way around. E.g., someone’s class performance doesn’t affect a student’s age, but the converse makes sense. When it comes to explanatory and response variables:
plot_data <-
gapminder |>
filter(country == "Portugal")
plot_data |>
ggplot() +
geom_line(mapping = aes(x = year, y = population), color='lightblue') +
geom_point(mapping = aes(x = year, y = population), color='darkblue') +
scale_y_continuous(labels = \(x) paste(x / 10^6, "M")) +
labs(x = "Year", y = "Population (in millions)",
title = "Population in Portugal, Over Time",
subtitle = paste("Correlation:",
round(cor(plot_data$population,
plot_data$life_exp), 2),
" Observations:",
count(plot_data))) +
theme_hc()
Note: with time on the x-axis, it’s usually a good practice to connect points using a line plot. If there are few enough data points, as there are here, indicating them with another
geom_pointplot is also recommended. We will discuss Time Series more in a later lesson …
Here, we see a positive trend between Population and Year for the country of Portugal. It makes sense for the year to affect the population, but of course the population cannot affect the year. We only have 58 data points, but the correlation coefficient is just as strong as what we see on the plot. Notice how the slight dip between the 60s and 70s doesn’t affect the strength of the correlation that much given how linear the rest of the data is.
Select a couple explanatory and response variable from the updated
gapminder dataset, plot them, and calculate their
correlations. Determine whether they are strong or not, and explain your
reasoning.