Introduction

Welcome to Part 5 of my Stats in R Tutorial series. If you missed the last tutorial, you can check it out at the following link. Now that we have covered a bit of the tidyverse and some basic descriptive stats, we can move on to what is the core of statistics, which is inferential statistics. We will also learn a number of other useful general functions in R. By the end of this tutorial you will be able to:

Inferential Statistics

Previously, we discussed descriptive statistics, which simply summarize the characteristics of a dataset. As an example, the IQR determines where 50% of the data of a given variable lies (as well as the 25% and 75% quartile specifications of that range). Inferential statistics are used to test hypotheses and make generalizations (or at least some probabilistic guesses) of the population at large. Examples of inferential statistics include regressions, chi-squared tests, and the topic of today: z-tests.

Standard Error

One must understand the concept of standard error and how z-scores relate to that, so we will briefly revisit these concepts before moving to z-tests. Remember that a population is an entire group you want to draw inferences about (for example, all of the world’s smokers) and a sample is a specific group you draw data from that is supposed to represent this population (a small group of smokers selected at random). To illustrate this, we can simulate some data again, as shown in some of my previous tutorials. This time, we will simulate the IQ of people from a given country. There are more than one million people in this country, and their IQ represents the average, with IQ score means of 100 and a standard deviation of 15.

#### Simulate Population Data ####
set.seed(123)
population <- rnorm(n = 1000000,
                    mean = 100,
                    sd = 15)

Now we can draw a histogram as always to see what this looks like. Remember that the parameters are the known values from a population, such as the mean and standard deviation. Here we know that \(\mu = 100\) and \(\sigma = 15\). With this knowledge, we should be able to at least visually inspect that this is the case here in our histogram.

#### Plot Population Data ####
hist(population,
     main = "Histogram of Population IQ",
     xlab = "IQ Score",
     col = "steelblue")

We may want to be more exact about this. Let’s use a new function called abline. This function draws lines on your plot, which include horizontal, vertical, and even regression lines (which is where the “ab” part comes from). For now, we will draw a vertical line with the v argument, color the line red with col, and change the linewidth with lwd. Note that when you use abline multiple times on the same plot, they will overlap each other, so you will need to be mindful of this if you are using multiple lines. For now, we will simply supply the mean of the population as the x-axis location where the line will be drawn.

#### Plot Population Data ####
hist(population,
     main = "Histogram of Population IQ",
     xlab = "IQ Score",
     col = "steelblue")

#### Draw Abline ####
abline(v=mean(population),
       col="red",
       lwd=4)

Now that we have the power of the tidyverse at our hands, we can recreate this and customize it to our liking. Let’s go ahead and do that. Since population is a vector, we will first transform it into a data frame before plotting, otherwise ggplot will get confused since it is used to dealing with data frames.

library(tidyverse)

data.frame(population) %>% 
  ggplot(aes(x=population))+
  geom_histogram(color = "black",
                 fill = "steelblue")+
  labs(x="IQ Scores",
       y="Count",
       title = "Population IQ Scores (n = 1,000,000)")+
  geom_vline(aes(xintercept = mean(population)),
             color = "red",
             linewidth = 2)+
  theme_bw()

As expected, the line is located at exactly 100 along the x-axis. The bins may look slightly different, but that’s just because the defaults for base R and ggplot are different. The overall shape is still relatively the same.

Remember that we can also sample from data randomly with the sample function and that these given samples will vary from the true population mean and standard deviation. To illustrate that, we can go ahead and run 10 random samples of this population data like shown in previous tutorials, however, you will notice below that this is a lot of code.

#### Create Samples ####
sample.1 <- sample(population, 
                   size = 50,
                   replace = T)

sample.2 <- sample(population, 
                   size = 50,
                   replace = T)

sample.3 <- sample(population, 
                   size = 50,
                   replace = T)

sample.4 <- sample(population, 
                   size = 50,
                   replace = T)

sample.5 <- sample(population, 
                   size = 50,
                   replace = T)

sample.6 <- sample(population, 
                   size = 50,
                   replace = T)

sample.7 <- sample(population, 
                   size = 50,
                   replace = T)

sample.8 <- sample(population, 
                   size = 50,
                   replace = T)

sample.9 <- sample(population, 
                   size = 50,
                   replace = T)

sample.10 <- sample(population, 
                   size = 50,
                   replace = T)

This is a pain in the butt to type and just makes your R script obnoxiously long. I will make a quick detour for now about why you shouldn’t do this.

Efficient Coding

There are two themes I will highlight from here: making your code more readable and making your code simpler. First, if your code has many functions within functions, like this:

data.frame(sample(rnorm(n=length(x),mean=mean(x),sd=sd(x))))

…it looks terrible and is difficult to read. Rather than trying to read the equivalent of a code salad, my personal preference is to create a new line for any new argument within a function (when useful). Basically, if you have a very verbose function within a function, use open brackets so it is easier to read. For example, you can start a data.frame function like so:

data.frame(x = rnorm(n=1000))

Or like this:

data.frame(
  x = rnorm(n=1000)
)

The second version can be created automatically by hitting enter after the first parentheses. Thereafter, we only include one argument in the “body” and separate that with spaces. We can clearly see now that data frame is the main function, and rnorm is a sub-level function within this main function. This makes it easier to read line by line if we have something like this:

data.frame(
  x = rnorm(n=1000,
            mean=5,
            sd=2),
  y = rnorm(n=1000,
            mean=10,
            sd=20),
  z = rnorm(n=1000,
            mean=700,
            sd=200)
)

Here we can clearly see that x, y, and z are just three variables created with the rnorm function, with each argument within that function in an easy-to-read format because the arguments are spaced on different lines.

Anyways, a second point I want to make is that you should make your code as simple and quick as humanly possible. For example, why make that ten-piece code we made earlier when we can smash it all together into one function? If there is ever a way we can cut down typing time, we should always execute it.

One way we can quickly automate this process is by using the replicate function, which simply takes each sample we wanted earlier and repeats it 10 times. The structure of this is basically:

Here I have also opened the data.frame bracket so it makes the function easier to read. We save the replicated function here into an object called df. You can inspect the head and see it has given each sample a generic name of “X”. To make sure you have the right placement of parentheses, you can always click just to the right of a parentheses and it will highlight the function it belongs to.

df <- data.frame(
  replicate(
    n = 10, 
    sample(
      population,
      size = 50,
      replace = T
      )
    )
)

head(df)
##         X1        X2        X3        X4        X5        X6        X7
## 1 105.8404  96.38619  94.52429  98.49531 114.67532  80.27814 119.05657
## 2 102.2202  93.80960 109.60829  96.18276  94.00549 101.94919  84.51748
## 3 120.8653 107.82926  91.30786 131.33691  76.81899 102.52536  85.09278
## 4 115.1366  97.66699 139.40884  81.80746 120.09119  93.40466  99.08544
## 5 110.1159  67.34482 117.97518 103.67674  79.92993 106.48351 115.85482
## 6 128.6480 105.12500  94.27927  71.53071  87.02698 119.58307 116.23702
##          X8        X9       X10
## 1 101.51501  99.60230  97.28090
## 2  80.05765  96.15028 120.82124
## 3 109.93261 118.96527 114.56012
## 4  75.76042 119.11347  90.69073
## 5  94.40234  92.75085  97.48635
## 6 129.59019 111.29559  73.20283

Now we can make a histogram of each sample that we made. For example, if we just want Sample 1 (labeled X1), we just need to call it in ggplot.

df %>% 
  ggplot(aes(x=X1))+
  geom_histogram(color = "black",
                 fill = "steelblue")+
  labs(x="IQ Scores",
       y="Count",
       title = "Sample 1 IQ Scores (n = 50)")+
  geom_vline(aes(xintercept=mean(X1)),
             color = "red",
             linewidth = 2)+
  theme_bw()

This is nothing new, as we explored this in Tutorial 3. The sample mean \(\bar{x}\) is slighty different from the population mean \(\mu\). But how does this compare to the 9 other samples we drew? Surely there is variation among the samples. We will use pivot_longer again to convert the scores into a general factor (the sample name) and their respective values. We can then inspect the values to see if there is any variation.

#### Get By-Sample Mean/SD ####
grouped.df <- df %>% 
  pivot_longer(cols = contains("X"),
               names_to = "Sample",
               values_to = "Value") %>% 
  group_by(Sample) %>% 
  mutate(Mean = mean(Value),
         SD = sd(Value))

#### Inspect ####
grouped.df
## # A tibble: 500 × 4
## # Groups:   Sample [10]
##    Sample Value  Mean    SD
##    <chr>  <dbl> <dbl> <dbl>
##  1 X1     106.   98.5  15.3
##  2 X2      96.4 100.   12.6
##  3 X3      94.5  99.3  16.4
##  4 X4      98.5 103.   17.2
##  5 X5     115.  100.   13.7
##  6 X6      80.3  99.6  15.3
##  7 X7     119.  101.   13.5
##  8 X8     102.   95.2  14.2
##  9 X9      99.6 103.   13.5
## 10 X10     97.3 100.   13.9
## # … with 490 more rows

Just from this table, it seems that there is definitely variation in both the means and standard deviation. What if we simply obtained the mean and standard deviation without grouping them by sample first?

df %>% 
  pivot_longer(cols = contains("X"),
             names_to = "Sample",
             values_to = "Value") %>% 
  summarise(Mean_Value = mean(Value),
            SD_Value = sd(Value))
## # A tibble: 1 × 2
##   Mean_Value SD_Value
##        <dbl>    <dbl>
## 1       100.     14.6

You can see they are almost exactly the same as the population values! To show how the sample values deviate from the population values, let’s create a faceted ggplot that we used in our last tutorial. We will do this using the following procedure:

Remember to put the xintercept argument into the aes function, as we are telling the plot to use the data we assign to the line.

grouped.df %>% 
  ggplot(aes(x=Value))+
  geom_histogram(color = "black",
                 fill = "steelblue")+
  labs(x="IQ Scores",
       y="Count",
       title = "Sample IQ Scores")+
  theme_bw()+
  facet_wrap(~Sample,
             nrow = 5,
             ncol = 2)+
  geom_vline(aes(xintercept = Mean),
             color = "red",
             linewidth = 1)+
  geom_vline(aes(xintercept = 100),
             color = "orange",
             linewidth = 1)

You can see that the orange line is always at 100. However, two things are immediately present. First, the sample mean almost always deviates from the population mean. Second, this deviation is not very high. If our samples were not very representative, this would cause some major issues, as the distance between the red and orange lines would be far, and this would make it difficult to make any inferences about the samples, as they wouldn’t be good at generalizing to the population they were supposed to draw from.

This natural deviation between samples is called standard error. It has a formula, shown below:

\[ \text{SE} = \frac{\sigma}{\sqrt{n}} \]

where \(\sigma\) equals the standard deviation of the sample, and \(n\) is the sample size. We can calculate this directly for one of our samples by simply selecting a sample (here X1) and creating a summarized variable called SE with the above formula.

se <- df %>% 
  select(X1) %>% 
  summarise(SE = sd(X1)/(sqrt(50)))
se
##         SE
## 1 2.168775

Looking at our above plots and the table showing the different means for each sample, this number seems correct. Their respective deviation from the population mean is not much, around 2.33. We can also obtain the standard deviation from the standard error with the following formula:

\[ \sqrt{n}\times\text{SE} \]

#### Obtain Standard Deviation From SE ####
n <- 50
sd <- sqrt(n)*se
sd
##         SE
## 1 15.33555

We can see that this value isn’t perfect, but given we only sampled one group of 50 participants, this number is pretty close to population \(\sigma\). With more values in a sample, this number would eventually approach 15.

This is essentially the basis for our z-tests, which we explore in the next section. While it is quite rare in actual practice, knowing the population mean and standard deviation gives us a generalized idea of what samples that come from it look like. But because these samples will naturally fluctuate, and because some samples are less representative than others, testing how different a sample is from a population speaks a lot to how extreme that sample is, based off it’s mean and standard error.

Z-Test

Now that we know about z-scores from previous tutorials and standard error from this tutorial, we can learn about the z-test. Remember that z-scores are the number of standard deviations a raw score is from the mean. They also represent raw score percentiles. If a given raw score has a corresponding z-score of \(+3.00\), this score is quite high, as 99.9% of scores are below this score in a normal distribution. A z-score of \(-1.00\) would mean that only around 16% of the distribution is below this score. A sample distribution and it’s percentiles can be seen below. Here we plot the cumulative density function (CDF), or the probability that a random variable \(x\) will take a value less than or equal to the random variable. I use z-scores below to demonstrate that the scale of the data does not matter in obtaining the CDF.

#### Transform Raw Scores to Z and Percentile ####
z.to.p <- grouped.df %>% 
  mutate(Scale_X = scale(Value),
         Prob_X = pnorm(Scale_X)) 

#### Plot Them ####
z.to.p %>%  
  ggplot(aes(x=Scale_X,
             y=Prob_X*100))+
  geom_line(color = "green3",
            linewidth = 3)+
  geom_point(size = 2,
             color = "darkgreen")+
  labs(x="Z-Score",
       y="Percentile",
       title = "Cumulative Density Function (Mu = 0, Sigma = 1)")+
  theme_bw()+
  scale_y_continuous(n.breaks = 30)+
  scale_x_continuous(n.breaks = 10)

As z-scores approach the far right region of the plot, the percentiles are typically very high, whereas the opposite is true for negative z-scores. The middle point of values lies in the z-score of 0 which is approximately around the 50 percentile mark. This indicates that the higher a z-score, the less likely scores will be above it, and the lower a z-score, the less likely scores will be below it. In other words, a z-score simply measures the extremity of a value’s probabilty of existence.

Knowing z-scores can consequently tell you how “extreme” a score is. If we get a score that has a low probability of occurrence, we can probably guess that this isn’t normal. Additionally, if we know both a given population mean and given sample mean, along with it’s corresponding standard error, we can test whether or not a sample is very extreme compared to what the population should be. This is where a z-test, and most other inferential tests for that matter, come in quite handy.

The formal equation for a z-test is shown below

\[ z = \frac{\bar{x}-\mu}{\sigma_{\bar{x}}} \]

where \(\bar{x}\) equals the mean of \(x\) and \(\sigma_{\bar{x}}\) is the standard error between \(\bar{x}\) and \(\mu\). If this is confusing, just remember that we already calculated error earlier, and that essentially gets plugged here. Just as a reminder, here is the SE formula, with \(\sigma_{\bar{x}}\) representing SE:

\[ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]

The \(z\) score obtained from the other formula is a bit different from what we have calculated before in earlier tutorials, as it is an estimation of a critical z-score necessary for testing. However, the idea is essentially the same as a normal z-score, in that it represents a standardized score based off mean centering.

A practical scenario may be helpful. Let’s say we are doing an experiment to test if a new medication is safe for reducing hypertension. We need to draw up to competing hypotheses for our z-test. One is the null hypothesis, which stipulates that there is no difference between our observed scores and the population scores. The alternative hypothesis states a contrary idea, in that there is a difference in the observed and population scores. Our z-test will only apply the null hypothesis. We set up a test to basically say “we should find no difference between these scores”. If the z-score ends up being so extreme that it’s probability of occurrence is low, we should reject the null hypothesis. Importantly, this does not mean we accept the alternative hypothesis, only that we have insufficient evidence to support the null.

With this in mind, let’s say we know the average blood pressure of the world population is supposed to be around 120 (I’m just guessing here so don’t take this number too seriously). We then find a sample of 71 participants and administer a drug treatment trial and then check their blood pressure thereafter. Their mean blood pressure is around 137 bpm, and this fluctuated about 40 bpm.

From this information, we know:

Instead of simulating a bunch of data right away, we can just test these values by plugging them into our formulas from before. First, let’s save all of this information as objects in R.

#### Calculate Stats ####
sigma <- 40
n <- 71
mu <- 120
m <- 137
se <- sigma/sqrt(n)
se
## [1] 4.747127

Then we just derive z with our previous formula.

#### Derive Z Value ####
z <- (m-mu)/se
z
## [1] 3.581114

Finally we have two options. The first is to use a z-table to derive what our percentile is given our z-score at this link. Alternatively, we can also just use the pnorm function to get the associated percentile, then subtract 1 to get the p value, or the probability that a score would exceed this value.

#### Derive Associated P Value ####
p <- 1-pnorm(z)
p
## [1] 0.0001710664

The value is extremely low…so low that we can conclude that our raw score is very high compared to what we expect if the sample and population means are equal. Less than .0002 percent of the normal population would have a blood pressure around this range. Given our z-score is also positive, this means that it is dangerously high…this medication may not be safe for human consumption. We would reject the null hypothesis (we cannot say there is no difference between the population mean and the sample mean given it’s extremity) and would discontinue this drug.

However, that was a lot of work to get a z-test. What if we created our own function to make the process quicker? We can do this! Recall before that a function just takes our assigned inputs and runs them how we already defined them. This is where saving objects becomes more helpful. By saving objects into a function, you can store the values and use them in a successive way that can be very helpful. The cat function prints a message by combining text strings (such as “z value”) and objects (which get printed next to these strings).

z.test <- function(sigma,n,mu,m){
  se <- sigma/sqrt(n) 
  z <- (m-mu)/se
  p <- 1-pnorm(z)
  cat("z value:", z, "p value:", p)
}

Then we just plug in our values from before.

z.test(sigma = 40,
       n = 70,
       mu = 120,
       m = 137)
## z value: 3.555805 p value: 0.0001884117

Alternatively, we can just use our previously saved objects and use them as our inputs.

z.test(sigma = sigma,
       n = n,
       mu = mu,
       m = m)
## z value: 3.581114 p value: 0.0001710664

Since our arguments are implicit, we can even just include our inputs without specifying their arguments.

z.test(sigma,
       n,
       mu,
       m)
## z value: 3.581114 p value: 0.0001710664

Either way, we now have our custom function that makes the process simpler.

Now that we have covered a lot of ground there, it may help to visualize all of this. First, let’s visualize what a statistically significant z-score is. Usually a z-score of around 1.96 is considered statistically significant, as it’s associated value is at the two extreme ends of a normal distribution. To highlight this, we can plot what this should look like using the nhstplot package. First we need to install as always.

install.packages("nhstplot")

After loading the library, we can call the plotztest function, which allows you to input any z-score you please and where it’s associated percentile range is. For our 1.96 cutoff criterion, our plot will look like this:

library(nhstplot)
plotztest(z = 1.96)

This plot shows a normal distribution with two highlighted regions. If a z-score is lower than -1.96 or higher than 1.96, it should fall within the red regions, indicating it is much higher or lower on average than other scores. So if our associated z-score is exactly 1.96, we can assume that only 0.025% of any given sample within a population would be above this score, indicating it’s extremity. Many inferential tests use this criterion, which if added together from both sides, equals .05%. This is the typical alpha region used to test statistical significance. But what of our z-score that we obtained?

plotztest(z = 3.581114)

You can see that the red regions are almost nonexistent now. This is because, as we discussed before, less than .001 percent of the sample scores in this range would have this value. So if the participants in our study have a mean blood pressure this high, they may be in serious danger if they take the medication from our study.

Conclusion

This was a lot, so I recommend going through all of this again to make sure you understand the content. Having a good understanding of z-tests will help a lot with understanding other inferential tests, particularly t-tests. This is coincidentally the subject of the next tutorial, so stay tuned for an upcoming tutorial!

My Other RPubs

Thank you for reading this tutorial. If you felt that any part of this tutorial was helpful, and you would like to support me, please consider buying me a cup of coffee.

Check out my other RPubs if you want to learn more about stats in R.

Thanks for reading this tutorial. As mentioned previously, we will likely cover t-tests next time, which are tests that many people are accustomed to using and may be more exciting to learn. Take care until then and happy coding!