Welcome to Part 5 of my Stats in R Tutorial series. If you missed the last tutorial, you can check it out at the following link. Now that we have covered a bit of the tidyverse and some basic descriptive stats, we can move on to what is the core of statistics, which is inferential statistics. We will also learn a number of other useful general functions in R. By the end of this tutorial you will be able to:
replicate
function to repeat a function several
times.abline
function in base
R.Previously, we discussed descriptive statistics, which simply summarize the characteristics of a dataset. As an example, the IQR determines where 50% of the data of a given variable lies (as well as the 25% and 75% quartile specifications of that range). Inferential statistics are used to test hypotheses and make generalizations (or at least some probabilistic guesses) of the population at large. Examples of inferential statistics include regressions, chi-squared tests, and the topic of today: z-tests.
One must understand the concept of standard error and how z-scores relate to that, so we will briefly revisit these concepts before moving to z-tests. Remember that a population is an entire group you want to draw inferences about (for example, all of the world’s smokers) and a sample is a specific group you draw data from that is supposed to represent this population (a small group of smokers selected at random). To illustrate this, we can simulate some data again, as shown in some of my previous tutorials. This time, we will simulate the IQ of people from a given country. There are more than one million people in this country, and their IQ represents the average, with IQ score means of 100 and a standard deviation of 15.
#### Simulate Population Data ####
set.seed(123)
population <- rnorm(n = 1000000,
mean = 100,
sd = 15)
Now we can draw a histogram as always to see what this looks like. Remember that the parameters are the known values from a population, such as the mean and standard deviation. Here we know that \(\mu = 100\) and \(\sigma = 15\). With this knowledge, we should be able to at least visually inspect that this is the case here in our histogram.
#### Plot Population Data ####
hist(population,
main = "Histogram of Population IQ",
xlab = "IQ Score",
col = "steelblue")
We may want to be more exact about this. Let’s use a new function
called abline
. This function draws lines on your plot,
which include horizontal, vertical, and even regression lines (which is
where the “ab” part comes from). For now, we will draw a vertical line
with the v
argument, color the line red with
col
, and change the linewidth with lwd
. Note
that when you use abline
multiple times on the same plot,
they will overlap each other, so you will need to be mindful of this if
you are using multiple lines. For now, we will simply supply the mean of
the population
as the x-axis location where the line will
be drawn.
#### Plot Population Data ####
hist(population,
main = "Histogram of Population IQ",
xlab = "IQ Score",
col = "steelblue")
#### Draw Abline ####
abline(v=mean(population),
col="red",
lwd=4)
Now that we have the power of the tidyverse at our hands, we can
recreate this and customize it to our liking. Let’s go ahead and do
that. Since population
is a vector, we will first transform
it into a data frame before plotting, otherwise ggplot
will
get confused since it is used to dealing with data frames.
library(tidyverse)
data.frame(population) %>%
ggplot(aes(x=population))+
geom_histogram(color = "black",
fill = "steelblue")+
labs(x="IQ Scores",
y="Count",
title = "Population IQ Scores (n = 1,000,000)")+
geom_vline(aes(xintercept = mean(population)),
color = "red",
linewidth = 2)+
theme_bw()
As expected, the line is located at exactly 100 along the x-axis. The
bins may look slightly different, but that’s just because the defaults
for base R and ggplot
are different. The overall shape is
still relatively the same.
Remember that we can also sample from data randomly with the
sample
function and that these given samples will vary from
the true population mean and standard deviation. To illustrate that, we
can go ahead and run 10 random samples of this population data like
shown in previous tutorials, however, you will notice below that
this is a lot of code.
#### Create Samples ####
sample.1 <- sample(population,
size = 50,
replace = T)
sample.2 <- sample(population,
size = 50,
replace = T)
sample.3 <- sample(population,
size = 50,
replace = T)
sample.4 <- sample(population,
size = 50,
replace = T)
sample.5 <- sample(population,
size = 50,
replace = T)
sample.6 <- sample(population,
size = 50,
replace = T)
sample.7 <- sample(population,
size = 50,
replace = T)
sample.8 <- sample(population,
size = 50,
replace = T)
sample.9 <- sample(population,
size = 50,
replace = T)
sample.10 <- sample(population,
size = 50,
replace = T)
This is a pain in the butt to type and just makes your R script obnoxiously long. I will make a quick detour for now about why you shouldn’t do this.
There are two themes I will highlight from here: making your code more readable and making your code simpler. First, if your code has many functions within functions, like this:
data.frame(sample(rnorm(n=length(x),mean=mean(x),sd=sd(x))))
…it looks terrible and is difficult to read. Rather than trying to
read the equivalent of a code salad, my personal preference is to create
a new line for any new argument within a function (when useful).
Basically, if you have a very verbose function within a function, use
open brackets so it is easier to read. For example, you can start a
data.frame
function like so:
data.frame(x = rnorm(n=1000))
Or like this:
data.frame(
x = rnorm(n=1000)
)
The second version can be created automatically by hitting enter
after the first parentheses. Thereafter, we only include one argument in
the “body” and separate that with spaces. We can clearly see now that
data frame is the main function, and rnorm
is a sub-level
function within this main function. This makes it easier to read line by
line if we have something like this:
data.frame(
x = rnorm(n=1000,
mean=5,
sd=2),
y = rnorm(n=1000,
mean=10,
sd=20),
z = rnorm(n=1000,
mean=700,
sd=200)
)
Here we can clearly see that x, y, and z are just three variables
created with the rnorm
function, with each argument within
that function in an easy-to-read format because the arguments are spaced
on different lines.
Anyways, a second point I want to make is that you should make your code as simple and quick as humanly possible. For example, why make that ten-piece code we made earlier when we can smash it all together into one function? If there is ever a way we can cut down typing time, we should always execute it.
One way we can quickly automate this process is by using the
replicate
function, which simply takes each sample we
wanted earlier and repeats it 10 times. The structure of this is
basically:
Here I have also opened the data.frame
bracket so it
makes the function easier to read. We save the replicated function here
into an object called df
. You can inspect the head and see
it has given each sample a generic name of “X”. To make sure you have
the right placement of parentheses, you can always click just to the
right of a parentheses and it will highlight the function it belongs
to.
df <- data.frame(
replicate(
n = 10,
sample(
population,
size = 50,
replace = T
)
)
)
head(df)
## X1 X2 X3 X4 X5 X6 X7
## 1 105.8404 96.38619 94.52429 98.49531 114.67532 80.27814 119.05657
## 2 102.2202 93.80960 109.60829 96.18276 94.00549 101.94919 84.51748
## 3 120.8653 107.82926 91.30786 131.33691 76.81899 102.52536 85.09278
## 4 115.1366 97.66699 139.40884 81.80746 120.09119 93.40466 99.08544
## 5 110.1159 67.34482 117.97518 103.67674 79.92993 106.48351 115.85482
## 6 128.6480 105.12500 94.27927 71.53071 87.02698 119.58307 116.23702
## X8 X9 X10
## 1 101.51501 99.60230 97.28090
## 2 80.05765 96.15028 120.82124
## 3 109.93261 118.96527 114.56012
## 4 75.76042 119.11347 90.69073
## 5 94.40234 92.75085 97.48635
## 6 129.59019 111.29559 73.20283
Now we can make a histogram of each sample that we made. For example,
if we just want Sample 1 (labeled X1), we just need to call it in
ggplot
.
df %>%
ggplot(aes(x=X1))+
geom_histogram(color = "black",
fill = "steelblue")+
labs(x="IQ Scores",
y="Count",
title = "Sample 1 IQ Scores (n = 50)")+
geom_vline(aes(xintercept=mean(X1)),
color = "red",
linewidth = 2)+
theme_bw()
This is nothing new, as we explored this in Tutorial 3. The sample
mean \(\bar{x}\) is slighty different
from the population mean \(\mu\). But
how does this compare to the 9 other samples we drew? Surely there is
variation among the samples. We will use pivot_longer
again
to convert the scores into a general factor (the sample name) and their
respective values. We can then inspect the values to see if there is any
variation.
#### Get By-Sample Mean/SD ####
grouped.df <- df %>%
pivot_longer(cols = contains("X"),
names_to = "Sample",
values_to = "Value") %>%
group_by(Sample) %>%
mutate(Mean = mean(Value),
SD = sd(Value))
#### Inspect ####
grouped.df
## # A tibble: 500 × 4
## # Groups: Sample [10]
## Sample Value Mean SD
## <chr> <dbl> <dbl> <dbl>
## 1 X1 106. 98.5 15.3
## 2 X2 96.4 100. 12.6
## 3 X3 94.5 99.3 16.4
## 4 X4 98.5 103. 17.2
## 5 X5 115. 100. 13.7
## 6 X6 80.3 99.6 15.3
## 7 X7 119. 101. 13.5
## 8 X8 102. 95.2 14.2
## 9 X9 99.6 103. 13.5
## 10 X10 97.3 100. 13.9
## # … with 490 more rows
Just from this table, it seems that there is definitely variation in both the means and standard deviation. What if we simply obtained the mean and standard deviation without grouping them by sample first?
df %>%
pivot_longer(cols = contains("X"),
names_to = "Sample",
values_to = "Value") %>%
summarise(Mean_Value = mean(Value),
SD_Value = sd(Value))
## # A tibble: 1 × 2
## Mean_Value SD_Value
## <dbl> <dbl>
## 1 100. 14.6
You can see they are almost exactly the same as the population
values! To show how the sample values deviate from the population
values, let’s create a faceted ggplot
that we used in our
last tutorial. We will do this using the following procedure:
grouped.df
object we already saved and pipe it
into ggplot
.geom_vline
. One is colored
red for each sample’s mean. Another is colored orange for the population
mean.Remember to put the xintercept
argument into the
aes
function, as we are telling the plot to use the data we
assign to the line.
grouped.df %>%
ggplot(aes(x=Value))+
geom_histogram(color = "black",
fill = "steelblue")+
labs(x="IQ Scores",
y="Count",
title = "Sample IQ Scores")+
theme_bw()+
facet_wrap(~Sample,
nrow = 5,
ncol = 2)+
geom_vline(aes(xintercept = Mean),
color = "red",
linewidth = 1)+
geom_vline(aes(xintercept = 100),
color = "orange",
linewidth = 1)
You can see that the orange line is always at 100. However, two things are immediately present. First, the sample mean almost always deviates from the population mean. Second, this deviation is not very high. If our samples were not very representative, this would cause some major issues, as the distance between the red and orange lines would be far, and this would make it difficult to make any inferences about the samples, as they wouldn’t be good at generalizing to the population they were supposed to draw from.
This natural deviation between samples is called standard error. It has a formula, shown below:
\[ \text{SE} = \frac{\sigma}{\sqrt{n}} \]
where \(\sigma\) equals the standard
deviation of the sample, and \(n\) is
the sample size. We can calculate this directly for one of our samples
by simply selecting a sample (here X1) and creating a summarized
variable called SE
with the above formula.
se <- df %>%
select(X1) %>%
summarise(SE = sd(X1)/(sqrt(50)))
se
## SE
## 1 2.168775
Looking at our above plots and the table showing the different means for each sample, this number seems correct. Their respective deviation from the population mean is not much, around 2.33. We can also obtain the standard deviation from the standard error with the following formula:
\[ \sqrt{n}\times\text{SE} \]
#### Obtain Standard Deviation From SE ####
n <- 50
sd <- sqrt(n)*se
sd
## SE
## 1 15.33555
We can see that this value isn’t perfect, but given we only sampled one group of 50 participants, this number is pretty close to population \(\sigma\). With more values in a sample, this number would eventually approach 15.
This is essentially the basis for our z-tests, which we explore in the next section. While it is quite rare in actual practice, knowing the population mean and standard deviation gives us a generalized idea of what samples that come from it look like. But because these samples will naturally fluctuate, and because some samples are less representative than others, testing how different a sample is from a population speaks a lot to how extreme that sample is, based off it’s mean and standard error.
Now that we know about z-scores from previous tutorials and standard error from this tutorial, we can learn about the z-test. Remember that z-scores are the number of standard deviations a raw score is from the mean. They also represent raw score percentiles. If a given raw score has a corresponding z-score of \(+3.00\), this score is quite high, as 99.9% of scores are below this score in a normal distribution. A z-score of \(-1.00\) would mean that only around 16% of the distribution is below this score. A sample distribution and it’s percentiles can be seen below. Here we plot the cumulative density function (CDF), or the probability that a random variable \(x\) will take a value less than or equal to the random variable. I use z-scores below to demonstrate that the scale of the data does not matter in obtaining the CDF.
#### Transform Raw Scores to Z and Percentile ####
z.to.p <- grouped.df %>%
mutate(Scale_X = scale(Value),
Prob_X = pnorm(Scale_X))
#### Plot Them ####
z.to.p %>%
ggplot(aes(x=Scale_X,
y=Prob_X*100))+
geom_line(color = "green3",
linewidth = 3)+
geom_point(size = 2,
color = "darkgreen")+
labs(x="Z-Score",
y="Percentile",
title = "Cumulative Density Function (Mu = 0, Sigma = 1)")+
theme_bw()+
scale_y_continuous(n.breaks = 30)+
scale_x_continuous(n.breaks = 10)
As z-scores approach the far right region of the plot, the percentiles are typically very high, whereas the opposite is true for negative z-scores. The middle point of values lies in the z-score of 0 which is approximately around the 50 percentile mark. This indicates that the higher a z-score, the less likely scores will be above it, and the lower a z-score, the less likely scores will be below it. In other words, a z-score simply measures the extremity of a value’s probabilty of existence.
Knowing z-scores can consequently tell you how “extreme” a score is. If we get a score that has a low probability of occurrence, we can probably guess that this isn’t normal. Additionally, if we know both a given population mean and given sample mean, along with it’s corresponding standard error, we can test whether or not a sample is very extreme compared to what the population should be. This is where a z-test, and most other inferential tests for that matter, come in quite handy.
The formal equation for a z-test is shown below
\[ z = \frac{\bar{x}-\mu}{\sigma_{\bar{x}}} \]
where \(\bar{x}\) equals the mean of \(x\) and \(\sigma_{\bar{x}}\) is the standard error between \(\bar{x}\) and \(\mu\). If this is confusing, just remember that we already calculated error earlier, and that essentially gets plugged here. Just as a reminder, here is the SE formula, with \(\sigma_{\bar{x}}\) representing SE:
\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]
The \(z\) score obtained from the other formula is a bit different from what we have calculated before in earlier tutorials, as it is an estimation of a critical z-score necessary for testing. However, the idea is essentially the same as a normal z-score, in that it represents a standardized score based off mean centering.
A practical scenario may be helpful. Let’s say we are doing an experiment to test if a new medication is safe for reducing hypertension. We need to draw up to competing hypotheses for our z-test. One is the null hypothesis, which stipulates that there is no difference between our observed scores and the population scores. The alternative hypothesis states a contrary idea, in that there is a difference in the observed and population scores. Our z-test will only apply the null hypothesis. We set up a test to basically say “we should find no difference between these scores”. If the z-score ends up being so extreme that it’s probability of occurrence is low, we should reject the null hypothesis. Importantly, this does not mean we accept the alternative hypothesis, only that we have insufficient evidence to support the null.
With this in mind, let’s say we know the average blood pressure of the world population is supposed to be around 120 (I’m just guessing here so don’t take this number too seriously). We then find a sample of 71 participants and administer a drug treatment trial and then check their blood pressure thereafter. Their mean blood pressure is around 137 bpm, and this fluctuated about 40 bpm.
From this information, we know:
Instead of simulating a bunch of data right away, we can just test these values by plugging them into our formulas from before. First, let’s save all of this information as objects in R.
#### Calculate Stats ####
sigma <- 40
n <- 71
mu <- 120
m <- 137
se <- sigma/sqrt(n)
se
## [1] 4.747127
Then we just derive z with our previous formula.
#### Derive Z Value ####
z <- (m-mu)/se
z
## [1] 3.581114
Finally we have two options. The first is to use a z-table to derive
what our percentile is given our z-score at this link. Alternatively, we can
also just use the pnorm
function to get the associated
percentile, then subtract 1 to get the p value, or the probability that
a score would exceed this value.
#### Derive Associated P Value ####
p <- 1-pnorm(z)
p
## [1] 0.0001710664
The value is extremely low…so low that we can conclude that our raw score is very high compared to what we expect if the sample and population means are equal. Less than .0002 percent of the normal population would have a blood pressure around this range. Given our z-score is also positive, this means that it is dangerously high…this medication may not be safe for human consumption. We would reject the null hypothesis (we cannot say there is no difference between the population mean and the sample mean given it’s extremity) and would discontinue this drug.
However, that was a lot of work to get a z-test. What if we created
our own function to make the process quicker? We can do this! Recall
before that a function just takes our assigned inputs and runs them how
we already defined them. This is where saving objects becomes more
helpful. By saving objects into a function, you can store the values and
use them in a successive way that can be very helpful. The
cat
function prints a message by combining text strings
(such as “z value”) and objects (which get printed next to these
strings).
z.test <- function(sigma,n,mu,m){
se <- sigma/sqrt(n)
z <- (m-mu)/se
p <- 1-pnorm(z)
cat("z value:", z, "p value:", p)
}
Then we just plug in our values from before.
z.test(sigma = 40,
n = 70,
mu = 120,
m = 137)
## z value: 3.555805 p value: 0.0001884117
Alternatively, we can just use our previously saved objects and use them as our inputs.
z.test(sigma = sigma,
n = n,
mu = mu,
m = m)
## z value: 3.581114 p value: 0.0001710664
Since our arguments are implicit, we can even just include our inputs without specifying their arguments.
z.test(sigma,
n,
mu,
m)
## z value: 3.581114 p value: 0.0001710664
Either way, we now have our custom function that makes the process simpler.
Now that we have covered a lot of ground there, it may help to
visualize all of this. First, let’s visualize what a statistically
significant z-score is. Usually a z-score of around 1.96 is considered
statistically significant, as it’s associated value is at the two
extreme ends of a normal distribution. To highlight this, we can plot
what this should look like using the nhstplot
package.
First we need to install as always.
install.packages("nhstplot")
After loading the library, we can call the plotztest
function, which allows you to input any z-score you please and where
it’s associated percentile range is. For our 1.96 cutoff criterion, our
plot will look like this:
library(nhstplot)
plotztest(z = 1.96)
This plot shows a normal distribution with two highlighted regions. If a z-score is lower than -1.96 or higher than 1.96, it should fall within the red regions, indicating it is much higher or lower on average than other scores. So if our associated z-score is exactly 1.96, we can assume that only 0.025% of any given sample within a population would be above this score, indicating it’s extremity. Many inferential tests use this criterion, which if added together from both sides, equals .05%. This is the typical alpha region used to test statistical significance. But what of our z-score that we obtained?
plotztest(z = 3.581114)
You can see that the red regions are almost nonexistent now. This is because, as we discussed before, less than .001 percent of the sample scores in this range would have this value. So if the participants in our study have a mean blood pressure this high, they may be in serious danger if they take the medication from our study.
This was a lot, so I recommend going through all of this again to make sure you understand the content. Having a good understanding of z-tests will help a lot with understanding other inferential tests, particularly t-tests. This is coincidentally the subject of the next tutorial, so stay tuned for an upcoming tutorial!
Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.
Check out my other RPubs if you want to learn more about stats in R.
Thanks for reading this tutorial. As mentioned previously, we will likely cover t-tests next time, which are tests that many people are accustomed to using and may be more exciting to learn. Take care until then and happy coding!