Stack Overflow Developer Survey 2020
Each year, Stack Overflow surveys its users, who are primarily
software developers, about themselves, how they use Stack Overflow,
their work, and the development tools they use. In this analysis we’ll
look at a subset of the survey responses, from users who identified as
Data Scientists.
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
library(readr)
stack_overflow <- read_rds(ruta_stckov)
stack_overflow
colnames(stack_overflow)
[1] "respondent" "age_first_code_cut" "converted_comp" "job_sat" "purple_link" "age_cat" "age"
[8] "hobbyist"
glimpse(stack_overflow)
Rows: 2,261
Columns: 8
$ respondent <dbl> 36, 47, 69, 125, 147, 152, 166, 170, 187, 196, 221, 235, 262, 265, 271, 277, 322, 349, 414, 448, 518, 559, 563, 587, 607, 637, 653, 6…
$ age_first_code_cut <chr> "adult", "child", "child", "adult", "adult", "adult", "child", "adult", "child", "adult", "adult", "adult", "child", "adult", "adult"…
$ converted_comp <dbl> 77556, 74970, 594539, 2000000, 37816, 121980, 48644, 130000, 94539, 1920000, 48644, 54049, 171000, 125000, 85000, 165000, 4188, 13000…
$ job_sat <fct> Slightly satisfied, Very satisfied, Very satisfied, Very satisfied, Very satisfied, Very satisfied, Neither, Very satisfied, Slightly…
$ purple_link <chr> "Hello, old friend", "Hello, old friend", "Hello, old friend", "Amused", "Hello, old friend", "Hello, old friend", "Hello, old friend…
$ age_cat <chr> "At least 30", "At least 30", "Under 30", "At least 30", "Under 30", "Under 30", "Under 30", "Under 30", "At least 30", "Under 30", "…
$ age <dbl> 34, 53, 25, 41, 28, 30, 28, 26, 43, 23, 24, 35, 37, 51, 22, 43, 24, 27, 24, 41, 27, 37, 29, 28, 25, 46, 36, 31, 28, 31, 21, 54, 32, 3…
$ hobbyist <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Y…
Hypothesizing about the mean
Let’s hypothesize that the mean annual compensation of the population
of data scientists is $110000. The first thing we can do to check this
is to examine the sample data in the survey. Annual compensation,
converted to dollars, is included in the converted_comp column. The
sample mean is a type of point estimate, which is another name for a
summary statistic. You can calculate it with base-R, or more verbosely
with dplyr. Either way, the answer is $122000. That’s different from our
hypothesis, but is it meaningfully different?
mean_comp_samp <- mean(stack_overflow$converted_comp)
mean_comp_samp
[1] 119574.7
mean_comp_samp <- stack_overflow %>%
summarise(mean_compensation = mean(converted_comp)) %>%
pull(mean_compensation)
mean_comp_samp
[1] 119574.7
Generating a bootstrap distribution
To answer this, we need to generate a bootstrap distribution of
sample means. This is done by resampling the dataset, calculating the
sample mean for that resample, then repeating those steps.
# Repeat steps 1 & 2 many times
so_boot_distn <- replicate(
n = 5000,
expr = {
#Step 1. Resample
stack_overflow %>%
slice_sample(prop = 1, replace = TRUE) %>%
# Step 2. Calculate point estimate
summarise(mean_compensation = mean(converted_comp)) %>%
pull(mean_compensation)
}
)
head(so_boot_distn)
[1] 121172.8 122346.3 115986.9 120278.9 115449.6 118904.6
Visualizing the bootstrap distribution
Here’s a histogram of the bootstrap distribution. Its bell shape
means it’s roughly normally distributed. Notice that \(110000\) is on the left of the
distribution.
tibble(resample_mean = so_boot_distn) %>%
ggplot(aes(resample_mean)) +
geom_histogram(binwidth = 1000)

Standard Error
Recall that the standard deviation of the sample statistics in the
bootstrap distribution estimates the standard error of the
statistic.
std_error <- sd(so_boot_distn)
std_error
[1] 5508.39
Z-scores
Since variables have arbitrary units and ranges, before we test our
hypothesis, we need to standardize the values. A common way of
standardizing values is to subtract the mean, and
divide by the standard deviation. For hypothesis
testing, we use a variation where we take the sample statistic,
subtract the hypothesized parameter value, and divide by the standard
deviation of the statistic, which is the standard
error. The result is called a z-score. The
sample mean annual compensation for data scientists of $122000 , minus
the hypothesized compensation for data scientists of $110000, divided by
the bootstrap standard error of $5590 gives a z-score 1.71
mean_comp_hyp <- 110000
mean_comp_samp
[1] 119574.7
z_score <- (mean_comp_samp - mean_comp_hyp) / std_error
z_score
[1] 1.738206
Testing the hypothesis
Is that a big or small number? Determining that is the goal of this
course. In particular, we can now state one of the uses of hypothesis
testing. Determining whether a sample statistic is close to or far away
from an expected value.
Standard normal (z) distribution
Here’s a density plot of the probability density function for the
standard normal distribution. That is, a normal distribution with mean
zero and standard deviation one. It’s often called the z-distribution.
As you might guess, z-scores are related to the z-distribution.
# Create a dataset of z-distribution and plot it
tibble(x = seq(-4, 4, 0.01)) %>%
ggplot(aes(x)) +
stat_function(fun = dnorm) +
ylab("PDF(x)")

P-values
Hypothesis tests are like criminal trials.
The true situation is that the defendant committed the crime, or they
didn’t. There are also two possible outcomes. The judge gives a guilty
or a not guilty verdict. The initial assumption is that the defendant is
not guilty. It’s up to the prosecution team to come up with evidence
beyond a reasonable doubt that the defendant committed the crime in
order for a guilty verdict to be given.
Age of first programming experience
Let’s return to our survey. The age_first_code_cut
variable classifies when the user began programming. If they were 14 or
older, they are classified as “adult”; otherwise
“child”.
Suppose previous research suggests that 35% of software developers
programmed as children. This raises a question answerable with our
dataset.
Does our sample provide evidence that data scientists have a
greater proportion starting programming as a child?
Definitions
Hypothesis is a statement about a population
parameter. We don’t know the true value of this population
parameter; we can only make inferences about it from the data.
Hypothesis tests compare two competing hypotheses.
These two hypotheses are the null hypothesis,
representing the existing idea, and the alternative
hypothesis, representing a new idea.
They are denoted \(H_{0}\) and \(H_{A}\) respectively. The null
hypothesis is like the current champion, and the alternative hypothesis
is like a challenger trying to overthrow that champion. Here,
the null hypothesis is that our data won’t tell us anything new, and
that the proportion of data scientists starting programming as children
follows the previous research on software developers, at 35%. The
alternative hypothesis is that our hunch is correct, and that the
percentage is greater than 35%.
Criminal trials vs. Hypothesis Testing
Let’s compare the criminal trial with the hypothesis test. The
defendant committing the crime is equivalent to the alternative
hypothesis being true, and the defendant not committing the
crime is equivalent to the null hypothesis being true.
Rather than saying we accept the alternative hypothesis, the
verdicts are rejecting the null hypothesis, or failing
to reject the null hypothesis. Initially, we assume that the
null hypothesis is true, and this only changes if the sample provides
enough evidence to reject it.
The hypothesis testing equivalent of “beyond a reasonable doubt” is
known as the significance level.
One-tailed & two-tailed tests
The tails of a distribution are the left and right edges of its PDF.
Hypothesis tests determine whether the sample statistics lie in
the tails of the distribution. There are three types of tests,
and the phrasing of the alternative hypothesis determines which type you
should use. In this case, we need a right-tailed
test.
P-values
p-values measure the strength of support for the
null hypothesis. Small p-values mean our statistic is producing
an unlikely result in a tail of our null distribution.
p-values are probabilities, so they are always
between zero and one.
Defining p-values
The definition of a p-value has four parts. It’s a probability
related to where the statistic from our sample lies on the null
distribution. It measures how many values fall farther away than
what we observed in the direction of the alternative
hypothesis. It is based on the null distribution, which assumes
that the null hypothesis is true. The p-value measures
evidence. Does it favor the original assumption or the new
alternative?
Calculating the z-score
To calculate the p-value, we need to run through the
same steps as before. We get the sample statistic, in
this case the proportion of data scientists who started
programming as children. We get the hypothesized value from the
null hypothesis, \(35\%\). We get the
standard error from the bootstrap distribution. The
z-score is the difference between the proportions,
divided by the standard error.
prop_child_samp <- stack_overflow %>%
summarise(point_estimate = mean(age_first_code_cut == "child")) %>%
pull(point_estimate)
prop_child_samp
[1] 0.3914197
sd_boot_distn <- replicate(
n = 5000,
expr = {
#Step 1. Resample
stack_overflow %>%
slice_sample(prop = 1, replace = TRUE) %>%
# Step 2. Calculate point estimate
summarise(sample_mean = mean(age_first_code_cut == "child")) %>%
pull(sample_mean)
}
)
head(sd_boot_distn)
[1] 0.3958425 0.3918620 0.4073419 0.3808050 0.3927466 0.3900929
std_err = sd(sd_boot_distn)
std_err
[1] 0.01026421
prop_child_hyp <- 0.35
#std_err <- 0.0096028
z_score <- (prop_child_samp - prop_child_hyp) / std_err
z_score
[1] 4.035356
Calculating the p-value
The last step is new. We pass the z-score to the normal cumulative
distribution function, pnorm. Since we want the right-hand tail for this
test, we set lower-dot-tail to FALSE. The p-value is 8.04 e -10 That’s a
very small number, but is it small enough to reject the null
hypothesis?
p_value <- pnorm(z_score, lower.tail = FALSE)
p_value
[1] 2.725982e-05
Statistical significance
p-values quantify how much evidence there is for the
null hypothesis. Large p-values lead us to not have enough
evidence for the alternative hypothesis, sticking with the
assumed null hypothesis instead. Small p-values make us doubt
this original assumption in favor of the alternative
hypothesis. What defines the cutoff point between a small
p-value and a large one?
Significance level
The cutoff point is known as the significance level,
and is denoted alpha (α). The appropriate significance
level depends on the dataset and the discipline you are working in.
5% is the most common choice, but 10%
and 1% are also popular. The significance level gives
us a decision process for which hypothesis to support. If the
p-value is less than alpha, we reject the null
hypothesis. Otherwise we fail to reject it.
It’s important that you decide what the appropriate significance
level should be before you run your test. Otherwise there is a
temptation to decide on a significance level that lets you choose the
hypothesis you want.
Calculating the p-value
The workflow is now to set the significance level, in this case 0.05.
Then calculate the z-score and use the standard normal CDF to get the
p-value. In this case, the p-value of 8 e-06 is less
than or equal to point-zero-five, so we reject the null hypothesis. We
have strong evidence for the alternative hypothesis that the proportion
of data scientists starting programming as children is greater than
35%.
alpha <- 0.05
prop_child_samp <- stack_overflow %>%
summarise(point_estimate = mean(age_first_code_cut == "child")) %>%
pull(point_estimate)
prop_child_hyp <- 0.35
std_error <- 0.0096028
z_score <- (prop_child_samp - prop_child_hyp) / std_error
p_value = pnorm(z_score, lower.tail = FALSE)
p_value
[1] 8.041901e-06
p_value <= alpha
[1] TRUE
\[
p-value \ is \ less \ than \ or \ equal \ to \ \alpha, so \ reject \
H_{0} \ and \ accept \ H_{A}
\]
Confidence intervals
To get a sense as to potential values of the population parameter,
it’s common to choose a confidence interval of one minus the
significance level. For a significance level of 0.05,
we’d use a 95% confidence interval. Here’s the calculation using the
quantile method. The interval runs from 0.37 to
0.41 giving a range of plausible values for the
proportion of data scientists starting programming as children.
first_code_boot_samp <- replicate(
n = 5000,
expr = {
stack_overflow %>%
slice_sample(prop = 0.25, replace = TRUE) %>%
summarise(point_estimate = mean(age_first_code_cut == "child")) %>%
pull(point_estimate)
}
)
first_code_boot_distn <- tibble(first_code_child_rate = first_code_boot_samp)
head(first_code_boot_distn)
conf_int <- first_code_boot_distn %>%
summarise(lower = quantile(first_code_child_rate, 0.025),
upper = quantile(first_code_child_rate, 0.975))
conf_int
Types of errors
Returning to the criminal trial analogy, there are two possible truth
states and two possible outcomes, for four combinations in total. Two of
these indicate that the verdict was correct. If a defendant didn’t
commit the crime, but the verdict was guilty, they are wrongfully
convicted. If the defendant committed the crime but the verdict was not
guilty, they got away with it. These are both errors in justice.
Similarly, for hypothesis testing, there are two ways to get it right,
and two types of error. If you support the alternative hypothesis when
the null hypothesis was correct, you made a false positive error. If you
support the null hypothesis when the alternative hypothesis was correct,
you made a false negative error. These errors are sometimes known as
type one and type two errors.
Choosen H0 |
correct |
false negative |
Choosen HA |
false positive |
correct |
Possible errors in our example
\[
if \ p \leqslant \alpha, we \ reject \ \ H_{0}
\]
A false positive (Type I) error could have ocurred: We thought that
data scientists starting coding as children at a higher rate when in
reality they did not
\[
if \ p > \alpha, \ we \ fail \ to \ reject \ H_{0}
\]
A false negative (Type II) error could have ocurred: we thought that
data scientists coded as children at the same rate as software engineers
when in reality they coded as children at a higher rate.
Hypotheses
The null hypothesis is that the sample mean for the two
groups is the same, and the alternative hypothesis is that
the sample statistic for users who started coding as children is
greater than for users who started coding as adults. We can
write these hypotheses using equations. Mu (µ)
represents an unknown population mean, and we use subscripts to denote
which group the sample mean belongs to. An alternate way of writing the
equations is to compare the differences in sample means to zero.
Calculating group wise summary statistics
Calculating the summary statistics for each group is straightforward.
You start with the sample, group by the categorical variable, then
summarize the numeric variable. A dplyr
way of doing this
is shown, but there are many possibilities. Pause the video for a moment
and think about different ways of solving this. Here, the child
programmers have a mean compensation of $132419.6 dollars compared to
$111313.3 for adult programmers. Is that increase statistically
significant or could it be explained by sampling variability?
stack_overflow %>%
group_by(age_first_code_cut) %>%
summarise(mean_compensation = mean(converted_comp))
Test statistics
Although we don’t know the population mean, we estimate it
using the sample mean. x-bar is used to denote
a sample mean. Then we use subscripts to denote which group a sample
mean corresponds to. The difference between these two sample
means is called the test statistic for the hypothesis test. The
z-scores are another type of test statistic.
Standardizing the test statistic
z-scores are calculated by taking the sample statistic, subtracting
the mean of this statistic as the population parameter of interest, then
dividing by the standard error. In the two sample case, the test
statistic, denoted t, uses a similar equation. You take the difference
between the sample statistics for the two groups, subtract the
hypothesized difference between the two groups, then divide by the
standard error. \[
x =
\frac{sample \ stat + population \ parameter}{standard \ error}
\]
\[
t = \frac{difference \ in \ sample \ stats \ - difference \ in \
population \ parameters}{standard \ error}
\]
\[
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult}) \ - (\mu_{child} \ -
\mu_{adult})}{SE(\bar{x}_{child} \ - \bar{x}_{adult})}
\]
Standard Error
To calculate the standard error, needed for the denominator of the
test statistic equation, bootstrapping is most accurate. However, there
is an easier way to approximate it. You calculate the standard deviation
of the numeric variable for each group in the sample, and the number of
observations for each group. Then it’s just some arithmetic.
Assuming the null hypothesis is true
Here’s the test statistic equation again. If we assume that the
null hypothesis is true, there’s a simplification we
can make. The null hypothesis assumes that the population means
are equal. That makes one term in the numerator disappear.
Using the approximation for the standard error, we now have a way of
calculating the test statistic using only calculations on the sample
dataset. We need the mean, standard deviation, and number of
observations for each group. This is another group-by and summarize
combination. \[
SE(\bar{x}_{child} \ - \bar{x}_{adult}) \approx
\sqrt{\frac{s^2_{child}}{n_{child}} \ + \frac{s^2_{adult}}{n_{adult}}}
\] \[
H_{0}: \mu_{child} \ - \mu_{adult} = 0
\]
\[
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult})}{SE(\bar{x}_{child} \ -
\bar{x}_{adult})}
\]
\[
t = \frac{(\bar{x}_{child} \ -
\bar{x}_{adult})}{\sqrt{\frac{s^2_{child}}{n_{child}}} \ +
\frac{s^2_{adult}}{n_{adult}}}
\]
group_tests <- stack_overflow %>%
group_by(age_first_code_cut) %>%
summarise(x_bar = mean(converted_comp), s = sd(converted_comp), n = n())
# See the results
group_tests
Calculating the test statistic
Annoyingly, it requires some grim data manipulation code to calculate
the test statistic from this data frame. Since this isn’t a course on
data manipulation, let’s do the calculation with separate variables. The
numerator is a simple subtraction, and the denominator is like a
weighted hypotenuse. The t-statistic is 1.86. Just as
with z-scores, we can’t do anything with that number yet
\[
t = \frac{(\bar{x}_{child} \ -
\bar{x}_{adult})}{\sqrt{\frac{s^2_{child}}{n_{child}} \ +
\frac{s^2_{adult}}{n_{adult}}}}
\]
numerator <- group_tests$x_bar[2] - group_tests$x_bar[1]
denominator <- sqrt((group_tests$s[2]^2 / group_tests$n[2]) + (group_tests$s[1]^2 / group_tests$n[1]))
t_stat <- numerator / denominator
t_stat
[1] 1.869931
Hypothesis testing workflow
Identify population parameter that is hypothesized about
Specify the null & alternative hypothesis
Determine (standardized) test statistic & corresponding null
distribution
Conduct hypothesis test in R
Measure evidence against the null hypothesis
Make a decision comparing evidence to significance level
Interpret the results in the context of the original
problem
Calculating p-values from t-statistics
The test statistic, t, follows a t-distribution. t-distributions have
a parameter called the degrees of freedom, or
df for short. Here’s a line plot of the PDF of a
t-distribution with one degree of freedom in yellow, and the PDF of a
normal distribution in blue dashes. Notice that the t-distribution for
small degrees of freedom has fatter tails than the normal distribution,
but otherwise they look similar.
Degrees of freedom
As you increase the degrees of freedom, the t-distribution gets
closer to the normal distribution. In fact, a normal
distribution is a t-distribution with infinite degrees of
freedom. Degrees of freedom are defined as the maximum
number of logically independent values in the data sample.
That’s a fairly tricky concept, so let’s try an example.
Calculating degrees of freedom
Suppose your dataset has 5 independent observations. Four of the
values are 2, 6, 8, and 5. You also know the sample mean is 5. The last
value is no longer independent; it must be 4. Even though all five
observations in the sample were independent, because you know an
additional fact about it, you only have 4 degrees of freedom. In our two
sample case, there are as many degrees of freedom as observations, minus
two because we know two sample statistics.
Hypotheses
Recall the hypotheses for our Stack Overflow question about
compensation for the two age groups. Since this is a “greater than”
alternative hypothesis, we need a right-tailed test.
\(H_{0}:\) The mean compensation
(in USD) is the same for those that coded first as a child and those
that coded first as an adult.
\(H_{A}:\) The mean compensation
(in USD) is greater for those that coded first as a
child compared to those that coded first as an adult.
Use a right-tailed test
Significance level
We’re going to calculate a p-value in a moment. We need to decide on
a significance level before we do that. There are several possibilities;
I’m going to use 0.1
That means that if the p-value is less than point-one, we reject the
null hypothesis in favor of the alternative.
\[\alpha = 0.1\]
\[
if \ p \leqslant \alpha \ then \ reject \ H_{0}
\]
Calculating p-values: One proportion vs. a
value
p_value <- pnorm(z_score, loer.tail = FALSE)
Calculating p-values: Two means from different
groups
\[
t = \frac{(\bar{x}_{child} \ -
\bar{x}_{adult})}{\sqrt{\frac{s^2_{child}}{n_{child}} \ +
\frac{s^2_{adult}}{n_{adult}}}}
\]
Now we are calculating means rather than proportions, the z-score is
replaced with a t-test statistic. The calculation also needs the degrees
of freedom, which is the number of observations in both groups, minus
two.
In the previous steps, we used an approximation using sample
information (not bootstrapping) for the test statistic standard error. A
consequence of this is that to calculate the p-value, we need to
transform the test statistic using the t-distribution CDF instead of the
normal distribution CDF. Using this approximation adds more uncertainty
and that’s why this is a t instead of a z problem.
The t distribution allows for more uncertainty when using multiple
estimates in a single statistic calculation. Notice the use of
pt()
instead of pnorm()
, and that the df
argument is set to the degrees of freedom. This p-value is less than the
significance level of 0.1, we should reject the null
hypothesis in favor of the alternative hypothesis that Stack Overflow
data scientists who started coding as a child earn more.
numerator <- group_tests$x_bar[2] - group_tests$x_bar[1]
denominator <- sqrt((group_tests$s[2]^2 / group_tests$n[2]) + (group_tests$s[1]^2 / group_tests$n[1]))
t_stat <- numerator / denominator
t_stat
[1] 1.869931
degrees_of_freedom <- group_tests$n[2] + group_tests$n[1] - 2
degrees_of_freedom
[1] 2259
p_value <- pt(t_stat, df = degrees_of_freedom, lower.tail = FALSE)
p_value
[1] 0.0308113
Why is t needed?
The process for calculating p-values is to start with the sample
statistic, standardize it to get a test statistic, then transform it via
a cumulative distribution function. Previously, that final
transformation was denoted z
, and the CDF transformation
used the (standard normal) z-distribution. The test statistic was
denoted t
, and the transformation used the
t-distribution.
In which hypothesis testing scenario is a t-distribution needed
instead of the z-distribution?
Using a sample standard deviation to estimate the standard
error is computationally easier than using bootstrapping. However, to
correct for the approximation, you need to use a t-distribution when
transforming the test statistic to get the p-value.
One-sample proportion tests
The previous example tested a single proportion against a specific
value. As with means, we can also test differences between proportions
in two populations.
Recap
The previous hypothesis tests measured whether or not an
unknown population proportion was equal to some value. We used
a bootstrap distribution of the sample to estimate the standard
error of the sample statistic. The standard error was used to
calculate a standardized test statistic, which was used to get a
p-value, which was used to decide whether or not to reject the null
hypothesis. Bootstrap distributions can be computationally intensive to
calculate, so this time we’ll calculate the test statistic without
it.
Standardized test statistic for proportions
An unknown population parameter that is a proportion, or
population proportion for short, is denoted p
. The
sample proportion is denoted \(\hat{p}\), and the hypothesized value for
the population proportion is denoted \(p_{0}\) . As seen previously, the
standardized test statistic is a z-score. You calculate it by
starting with the sample statistic, subtracting its mean, then dividing
by its standard error.
\(\hat{p}\) minus the mean of \(\hat{p}\) , divided by the standard error
of \(\hat{p}\) .
Recall from Sampling in R, the mean of \(\hat{p}\) is \(p\). Under the null hypothesis, the unknown
proportion p is assumed to be the hypothesized population parameter
\(p_{0}\). The z-score is now \(\hat{p}\) minus \(p_{0}\), divided by the standard error of
\(\hat{p}\).
\(p\) : Sample proportion
\(\hat{p}\) : sample proportion
(sample statistic)
\(p_{0}\) : hypothesized population
proportion
\[ z = \frac{\hat{p} \ -
mean(\hat{p})}{standard \ error(\hat{p})} = \frac{\hat{p} \ -
p}{standard \ error(\hat{p})} \]
Assuming \(H_{0}\) is true. \(p = p_{0}\). so
\[ z = \frac{\hat{p} \ - p_{0}}{standard \
error(\hat{p})} \]
Easier standard error calculations
Here’s the approximate standard error for two sample means. For
proportions, under \(H_{0}\), the
standard error of \(\hat{p}\) is \(p_{0}\) times \(1
- p_{0}\) , divided by the number of observations, then
square-rooted. We can substitute this into our equation for the z-score.
This is easy to calculate because it only uses \(\hat{p}\) and \(n\), which we get from the sample, and
\(p_{0}\), which we chose.
\[ SE(\bar{x}_{child} \ - \bar{x}_{adult})
\approx \sqrt{\frac{s^2_{child}}{n_{child}} \ +
\frac{s^2_{adult}}{n_{adult}}} \]
\[ SE_{\hat{p}} \ = \sqrt{\frac{p_{0} \ *
(1 - p_{0})}{n}} \]
Assuming \(H_{0}\) is TRUE,
\[ z \ = \frac{\hat{p} \ -
p_{0}}{\sqrt{\frac{p_{0} \ * \ (1 \ - p_{0})}{n}}} \]
This only uses sample information (\(\hat{p}\) and \(n\)) and the hypothesized parameter (\(p_{0}\))
Why z instead of t?
Why we used a z-distribution here, but a t-distribution previously?.
This is the test statistic equation for the two sample mean
case. The standard deviation of the sample, s,
is calculated from the sample mean, \(\bar{x}\). That means that \(\bar{x}\) is used in the numerator to
estimate the population mean, and in the denominator to estimate the
population standard deviation. This dual usage causes an increase in our
uncertainty about the estimate of the population parameter. Since
t-distributions are effectively a normal distribution with fatter tails,
we can use them to account for this extra uncertainty. In effect, the
t-distribution provides extra caution against mistakenly rejecting the
null hypothesis. For proportions, we only use \(\hat{p}\) in the numerator, thus avoiding
the problem with uncertainty, and a z-distribution is fine.
\[ t \ = \frac{(\bar{x}_{child} \ -
\bar{x}_{adult})}{\sqrt{\frac{a^2_{child}}{n_{child}} \ +
\frac{s^2_{adult}}{n_{adult}}}} \]
\(s\) is calculated from \(\bar{x}\), so \(\bar{x}\) is used to estimate the
population mean and to estimate the population standard
deviation
This increases the uncertainty in our estimate of the population
parameter.
t-distribution has fatter tails than a normal
distribution
This gives an extra level of caution
\(\hat{p}\) only appears in the
numerator, so z-scores are fine
Stack overflow age categories
Let’s hypothesize that half the users are under thirty. We’ll set a
significance level of 0.01. Just over half the users are under
thirty.
\(H_{0}\) : The proportion of SO
users under thirty is equal to 0.5
\(H_{A}\) : The proportion of SO
users under thirty is not equal to 0.5
alpha <- 0.01
stack_overflow %>%
count(age_cat)
Variables for z
Let’s get the numbers needed for the z-score. \(\hat{p}\) is the proportion of rows in the
sample where age_cat equals under thirty. \(p_{0}\) is 0.5 according to the null
hypothesis. n is the number of rows in the dataset.
p_hat <- stack_overflow %>%
summarise(prop_under_30 = mean(age_cat == "Under 30")) %>%
pull(prop_under_30)
p_hat
[1] 0.5356037
p_0 <- 0.5
n <- nrow(stack_overflow)
n
[1] 2261
Calculating the z-score
Calculating the z-score is just arithmetic; the value is
three-point-five.
\[ z \ = \frac{\hat{p} \ -
p_{0}}{\sqrt{\frac{p_{0} \ * \ (1 \ - p_{0})}{n}}} \]
numerator <- p_hat - p_0
denominator <- sqrt((p_0 * (1 - p_0)) / n)
z_score <- numerator / denominator
z_score
[1] 3.385911
Calculating the p-value
For left-tailed alternative hypotheses, you transform the z-score
into a p-value using the pnorm with the default of lower-dot-tail equals
TRUE. For right-tailed alternative hypotheses you set lower-dot-tail to
FALSE. For two-tailed alternative hypotheses, you check whether the test
statistic lies in either tail, so the p-value is the sum of these two
values. Since the normal distribution PDF is symmetric, here this
simplifies to twice the left-tailed p-value. Here, the p-value is less
than the significance level of point-zero-one, so we reject the null
hypothesis and conclude that the proportion of users under thirty is not
equal to point-five.
# Left tailed ("less than")
p_value_less <- pnorm(z_score)
p_value_less
[1] 0.9996453
# Right tailed ("greater than")
p_value <- pnorm(z_score, lower.tail = FALSE)
p_value
[1] 0.0003547114
# Two-tailed ("not equal")
p_value_2t <- pnorm(z_score) + pnorm(z_score, lower.tail = FALSE)
p_value_2t
[1] 1
p_value_a <- 2 * pnorm(z_score)
p_value_a
[1] 1.999291
p_value <= alpha
[1] TRUE
Two sample proportion tests
The previous example tested a single proportion against a
specific value. As with means, we can also test differences
between proportions in two populations.
Comparing two proportions
The Stack Overflow survey contains a hobbyist variable. The value
“Yes” means the user described themself as a hobbyist; “No” means they
described themself as a professional. We can hypothesize that
the proportion of users who are hobbyists is the same for the under
thirty age category as the at least thirty category.
More formally, the null hypothesis is that the difference
between the population parameters for each group is zero. I’ve
set a significance level of 0.05.
\(H_{0}\) : The proportion of SO
users who are hobbyists is the same for those under thirty as those at
least thirty
\[
H_{0}: p_{\geq 30} \ - p_{< 30} \ = 0
\]
\(H_{A}\) : The proportion of SO
users who are hobbyists is different for those under thirty as those at
least thirty
\[
H_{A}: p_{\geq 30} \ - p_{<30} \neq 0
\]
Calculating the z-score
Let’s break down this z-score equation. The sample statistic is the
difference in the proportions for each category \((\hat{p}_{\geq30} \ - \hat{p}_{<30}) \ -
0\). That’s the two \(\hat{p}\)
values on the numerator. We subtract the hypothesized value of the
population parameter. Assuming the null hypothesis is true,
that’s just zero, so the term disappears. The
denominator is the standard error of the sample statistic \(SE(\hat{p}_{\geq30} \ -
\hat{p}_{<30})\).
Again we can avoid having to generate a bootstrap distribution. The
equation for the standard error is a slightly more complicated version
of the equation for the one sample case. In this equation, note that
p-hat is the sample proportion for the whole dataset, not for each
category. This whole dataset \(\hat{p}\) is known as a pooled
estimate of the population proportion. We need one more
equation to get \(\hat{p}\). It’s a
weighted mean of the p-hats for each category. This looks horrendous,
but it’s just arithmetic, and R is pretty good at that. The good news is
that we only need four numbers from the sample dataset to do these
calculations.
\[
z = \frac{(\hat{p}_{\geq30} \ - \hat{p}_{<30}) \ -
0}{SE(\hat{p}_{\geq30} \ - \hat{p}_{<30})}
\]
\[
SE(\hat{p}_{\geq0} \ - \hat{p}_{<0}) \ = \sqrt{\frac{\hat{p} \times \
(1 \ - \hat{p})}{n_{\geq30}} \ + \frac{\hat{p} \times \ (1 \ -
\hat{p})}{n_{<30}}}
\]
\(\hat{p}\) is a pooled estimate for
\(p\) (common unknown proportion of
successes)
\[
\hat{p} \ = \frac{n_{\geq30} \times \hat{p}_{\geq30} \ + n_{<30}
\times \hat{p}_{<30}}{n_{\geq30} \ + n_{<30}}
\]
We only need to calculate 4 numbers: \(\hat{p}_{\geq30}, \ \hat{p}_{<30}, \
n_{\geq30}, \ n_{<30}\)
Getting the numbers for the z-score
To calculate these four numbers you group_by()
the
categories, and summarize to calculate the sample proportion and row
counts. After that, you can do the arithmetic to get the test statistic
- the z-score is minus four-point-two - and call pnorm to get the
p-value. I’m not going to run through these steps because it’s exactly
the same as you’ve seen already. Instead, I’ll show you an easier
way.
stack_overflow %>%
group_by(age_cat) %>%
summarise(p_hat = mean(hobbyist == "Yes"), n = n())
Proportion tests using
prop_test()
Naturally, R has at least two functions for performing this
type of hypothesis test, known as a proportion
test. Base-R has a function named prop.test()
,
though unfortunately, its interface is somewhat peculiar. To
make your life easier it’s best to use prop_test()
from
infer. This takes a formula, with your numeric variable
on the left and the categories on the right. The order argument
specifies which \(\hat{p}\) should be
subtracted from the other. Here, it is specified so that you start with
the “at least thirty” \(\hat{p}_{\geq30}\) , then subtract the
“under thirty” \(\hat{p}_{<30}\)
The success
argument specifies which of the two
response levels to get the proportion of.
alternative
takes the same values as t-test. Here,
we have a two-sided alternative hypothesis. The
correct
argument specifies whether or not to apply
Yates’ continuity correction. This is a fudge factor
needed for technical reasons when the sample size is very
small. Since each group has over one thousand
observations, we don’t need it. The p-value is in the third
column of the resulting tibble. It’s smaller than the significance level
we specified earlier, so we can conclude that there is a
difference in the proportion of Stack Overflow hobbyists between the
under and over thirty groups.
library(infer)
stack_overflow %>%
prop_test(hobbyist ~ age_cat, order = c("At least 30", "Under 30"), success = "Yes",
alternative = "two-sided", correct = FALSE)
Chi-square test of independence
Just as ANOVA extends t-tests to more than two groups, chi-square
tests of independence extend proportion tests to more than two
groups.
Revisiting the proportion test
Here’s the proportion test from last time. In the first column, the
test statistic is the square of the z-score. Minus four-point-two-two
squared is seventeen-point-eight. In the second column, chisq_df means
the degrees of freedom of a chi-square test. Its value is one.
library(infer)
stack_overflow %>%
prop_test(hobbyist ~ age_cat, order = c("At least 30", "Under 30"), alternative = "two-sided", correct = FALSE)
Independence of variables
That proportion test had a positive result. There was evidence that
the hobbyist and age category variables had an association. If that
wasn’t the case, and the proportion of hobbyists was the same for each
age category, the variables would be considered statistically
independent. More formally, statistical independence of two
categorical variables is when the proportion of successes in the
response variable is the same across all categories of the explanatory
variable.
Job satisfaction and age category
Recall that the Stack Overflow sample has an age category variable
with two categories and a job satisfaction variable with five
categories.
stack_overflow %>%
count(age_cat)
stack_overflow %>%
count(job_sat)
Declaring the hypothesis
We can declare hypotheses to test for independence of these
variables. Here, age category is the response variable,
and job satisfaction is the explanatory variable.
The null hypothesis is that independence occurs. I’ve
set a significance level of 0.1. The test statistic is
denoted chi-square. It quantifies how far away the observed
results are from the values you’d expect if independence was
true.
\(H_{0}\) : Age categories are
independent of job satisfaction levels
\(H_{A}\) : Age categories are not
independent of job satisfaction levels
Test statistic denoted \(\chi^2\)
Assuming the independence, how far away are the observed results from
the expected values?
alpha = 0.1
Exploratory visualization: Proportional stacked bar
plot
Let’s explore the data using a proportional stacked bar plot.
fill
means two different things here. In the aesthetics,
fill
refers to the fill color of the bars. In
geom_bar
’s position argument, "fill"
makes
each bar the same height, so you can compare proportions. If the age
category was independent of the job satisfaction, the split between the
age categories would be at the same height in each of the five bars.
There’s some variation here, but we’ll need the hypothesis test to
determine whether it’s a significant difference.
ggplot(stack_overflow, aes(job_sat, fill = age_cat)) +
geom_bar(position = "fill") +
ylab("proportion")

Chi-square independence test using
chisq_test()
The hypothesis test for independence is called a chi-square
independence test. There’s a base-R function for it called
chisq.test()
, but like prop.test()
, it’s
fiddly to use. The easiest option is to use infer’s
chisq_test()
. Pipe from the sample dataset, passing
a formula with the response variable on the left and the explanatory
variable on the right. The p-value is 0.23,
which is above the significance level we set, so
we conclude that age categories are independent of job
satisfaction. The chi-square distribution, whose CDF is used to
calculate the p-value from the test statistic, has a degrees of freedom
argument, just like the t-distribution. The results show that there are
four degrees of freedom. This is the number of response categories minus
one, times the number of explanatory categories minus one. Two minus one
times five minus one is four.
stack_overflow %>%
chisq_test(age_cat ~ job_sat)
Degrees of freedom
\[
(No. \ of \ response \ categories \ - 1) \times (No. \ of \ explanatory
\ categories \ - 1)
\\(2 \ - 1) \times (5 \ - 1) \ = 4
\]
Swapping the variables
If we swap the variables, so age category is the response, and job
satisfaction is the explanatory variable, the visual inspection
technique is the same: see if the splits for each bar are in similar
places.
ggplot(stack_overflow, aes(age_cat, fill = job_sat)) +
geom_bar(position = "fill") +
ylab("proportion")

Chi-square both ways
If we run the chi-square test with the variables swapped, then the
results are identical. We should ask questions like “are
variables X and Y independent?”, not “is variable X independent
from variable Y?”, since the order doesn’t matter.
stack_overflow %>%
chisq_test(age_cat ~ job_sat)
stack_overflow %>%
chisq_test(job_sat ~ age_cat)
What about directions and tails?
We didn’t worry about tails in this test, and in fact
chisq_test()
doesn’t have an alternative
argument. This is because the test statistic is based
on the square of observed and expected counts, and square
numbers are non-negative. That means that chi-square tests tend
to be right-tailed tests.
1 Left-tailed chi-square tests are used in statistical
forensics to detect is a fit is suspiciously good because the data was
fabricated. Chi-square tests of variance can be two-tailed. These are
niche uses though.
args(chisq_test)
function (x, formula, response = NULL, explanatory = NULL, ...)
NULL
Chi-square goodness of fitness test
Last time you used a chi-square test to compare proportions in two
categorical variables. This time, we’ll use another variant of the
chi-square test to compare a single categorical variable to a
hypothesized distribution.
Purple links
The Stack Overflow survey contains a fun question about how the user
feels when they discover that they already visited the top resource when
trying to solve a coding problem. There are four possible answers,
stored in the purple_link variable.
purple_link_counts <- stack_overflow %>%
count(purple_link) %>%
arrange(desc(n))
purple_link_counts
Declaring the hypothesis
Let’s hypothesize that half the users in the population would respond
“Hello old friend”, and the other three responses would get one sixth
each. The tribble
function provides a convenient way of
manually entering values for a data frame, and returns a standard
tibble. We can specify the hypotheses as whether or not the sample
matches this hypothesized distribution. The test statistic measures how
far the distribution of proportions observed in the sample is from the
hypothesized distribution of proportions. I’ll set a significance level
of point-zero-one.
1 tribble is short for “row-wise tibble”; not to be
confused with the alien species from Star Trek
hypothesized <- tribble(~ purple_link, ~ prop,
"Hello, old friend", 1/2,
"Amused", 1/6,
"Indifferent", 1/6,
"Indifferent", 1/6,
"Annoyed", 1/6)
hypothesized
Hypothesized counts by category
To visualize the distribution of the purple links, it will help to
have the hypothesized counts. These are the hypothesized proportions
times the total number of observations in the sample.
n_total <- nrow(stack_overflow)
hypothesized <- tribble(~ purple_link, ~ prop,
"Hello, old friend", 1/2,
"Amused", 1/6,
"Indifferent", 1/6,
"Indifferent", 1/6,
"Annoyed", 1/6) %>%
mutate(n = prop * n_total)
hypothesized
Visualizing counts
The natural way to visualize the categories is with a bar plot. We’re
using geom_col()
here since we already calculated the
counts. The purple points show the hypothesized counts, so you can
compare the sample distribution to the hypothesized distribution. Two of
the bars are close to the values we hypothesized and two are slightly
different. We’ll need to run the hypothesis test to see if the
differences are statistically significant.
ggplot(purple_link_counts, aes(purple_link, n)) +
geom_col() +
geom_point(data = hypothesized, color = "red")

Chi-square goodness of fit test using
chisq_test()
The one sample chi-square test is called a goodness of fit test. To
run it, we need the hypothesized proportions in vector form rather than
in a tibble. Again we use chisq_test()
from
infer
. However, this time the arguments are different.
Piping from the dataset, we set response to the name of the column of
interest, and set \(p\) to the
hypothesized distribution. The degrees of freedom are one less
than the number of choices in the survey. Four minus one is
three. The p-value is very small, much lower than the
significance level we set. Thus we conclude that the
sample distribution of proportions is different than the hypothesized
distribution of proportions.
hypothesized_props <- c("Hello, old friend" = 1/2, Amused = 1/6, Indifferent = 1/6, Annoyed = 1/6)
stack_overflow %>%
chisq_test(response = purple_link, p = hypothesized_props)
Testing sample size
In order to conduct a hypothesis test, and be sure that the result is
fair, a sample must meet three requirements: it is a random sample of
the population; the observations are independent; and there are enough
observations. Of these, only the last condition is easily testable with
code.
The minimum sample size depends on the type of hypothesis tests you
want to perform. Let’s test some scenarios on the
late_shipments
dataset.
Using the late_shipments
dataset, get counts by the
freight_cost_group
columns.
Insert a suitable number to inspect whether the counts are “big
enough” for a two sample t-test.
# Get counts by freight_cost_group
counts <- late_shipments %>%
count(freight_cost_group)
Error in count(., freight_cost_group) : object 'late_shipments' not found
Using the late_shipments
dataset, get counts by the
late
column.
Insert a suitable number to inspect whether the counts are “big
enough” for a one sample proportion test.
Using the late_shipments
dataset, get counts by the
vendor_inco_term
and freight_cost_group
columns.
Insert a suitable number to inspect whether the counts are “big
enough” for a chi-square independence test.
Using the late_shipments
dataset, get counts by the
shipment_mode
column.
Insert a suitable number to inspect whether the counts are “big
enough” for an ANOVA test.
```r
# Count the values of shipment_mode
counts <- late_shipments %>%
count(shipment_mode)
# See the result
counts
```
<div data-pagedtable="false">
<script data-pagedtable-source type="application/json">
{"columns":[{"label":["shipment_mode"],"name":[1],"type":["chr"],"align":["left"]},{"label":["n"],"name":[2],"type":["int"],"align":["right"]}],"data":[{"1":"Air","2":"909"},{"1":"Air Charter","2":"6"},{"1":"Ocean","2":"84"},{"1":"NA","2":"1"}],"options":{"columns":{"min":{},"max":[10],"total":[2]},"rows":{"min":[10],"max":[10],"total":[4]},"pages":{}}}
</script>
</div>
```r
# Inspect whether the counts are big enough
all(counts$n >= 30)
```
```
[1] FALSE
```
The “There is only one test” framework
Traditional hypothesis tests make assumptions about the sample
dataset. Here, we’ll look at what to do when those assumptions don’t
hold.
Imbalance data
Sometimes you can’t get a sample representing all parts of your
population. Consider this subset of the Stack Overflow survey. There are
fewer users with hobbyist “No” than “Yes”, and fewer users with “At
least 30” ages than “Under 30”. In fact, there are no users with
hobbyist “No” and age “At least 30”. This is a problem for a proportion
test, since it requires that every group has at least ten observations.
I used count(…, .drop)
argument to include subgroups with a
zero count. A dataset where some group’s counts are much larger
than others is called “imbalanced”.
Hypotheses
We can declare hypotheses to test. I’ll set a significance level of
point-one.
\(H_{0}\) : The proportion of
hobbyists under 30 is the same as the proportion of hobbyists at least
30.
\(H_{A}\) : The proportion of
hobbists under 30 is different from the proportion of hobbyists at least
30
Proceeding with a proportion test regardless
Let’s ignore the small sample problems and run a proportion test. The
p-value is \(2.403 \times 10^{-5}\) ,
which is below the significance level. This test suggests that there is
enough evidence to reject the null hypothesis. We’ll revisit this
later.
A grammar of graphics
Traditionally, every type of plot has a name, like scatter plot or
line plot. In base-R you decide the plot type by specifying arguments to
the plot function, or by calling a dedicated function like hist. This is
fine until you want to draw a custom plot like a histogram with polar
coordinates. If a function to draw that doesn’t exist, you are stuck.
The beauty of the grammar of graphics system that ggplot2 implements is
that plots are broken down into components that can be combined in many
ways. It’s more flexible creating a new function for every plot type you
can think of. What if there was a grammar of hypothesis tests?
Scatter plot |
plot(, type = "p") |
ggplot( ) + geom_point( ) |
Line plot |
plot(, type = "l") |
ggplot( ) + geom_line( ) |
Histogram |
hist( ) |
ggplot( ) + geom_histogram( ) |
Box plot |
boxplot( ) |
ggplot( ) + geom_boxplot( ) |
Bar plot |
barplot( ) |
ggplot( ) + geom_bar( ) |
Pie plot |
pie( ) |
ggplot( ) + geom_bar( ) + coord_polar( ) |
A grammar of hypothesis tests
The good news is that there is a grammar of hypothesis tests. It was
invented by DataCamp instructor Allen Downey, and is called the
“There is only one test” framework. The framework was
implemented in R in the infer
package that you’ve used for
proportion and chi-square tests. All hypothesis tests can be performed
with this code flow, based on specify, hypothesize, generate, and
calculate. By changing the arguments to those functions you can run
standard tests like t-tests, or devise your own custom tests. We’ll
cover specify and hypothesize now, and the others in the next video.
generate is worth mentioning briefly. It creates simulated data
reflecting the null hypothesis. Using simulation rather than
arithmetic equations to get the test statistic is computationally
expensive, so it has only become accessible with modern
computing power. However, simulation has a benefit that it works well
even when you have small samples or imbalanced data.
1 Allen Downey teaches “Exploratory Data Analysis in
Python”.
Specifying the variables of interest
The Stack Overflow survey contains many variables, but for this test,
we only care about two of them. specify()
works like
dplyr’s select, returning only the response and explanatory columns.
specify()
To call specify, you pipe from the dataset, and pass a formula with
the response on the left and the explanatory variable on the right. In
the one sample case, use NULL for the explanatory value. Just as when
you called prop_test()
, you need to tell specify which
response value is regarded as a success. specify returns a tibble with
attributes noting the response and explanatory variable names.
hypothesize()
hypothesize()
declares the type of null hypothesis.
Proportion tests are a special case of the chi-square independence test,
so we choose “independence”. hypothesize doesn’t calculate anything; it
simply adds another attribute to the dataset.
For 2 sample tests, use "independence"
or
"point"
For 1 sample tests, use "point"
---
title: "Hypothesis testing "
output: html_notebook
author: Juan Fernando Mosquera Araujo
date: 2024-03-15
---

### **Hypothesis test and z-score**

### **A/B testing**

This an experimental design technique, which has roots in hypothesis testing, to test different scenarios and see which has the best performance

## **Stack Overflow Developer Survey 2020**

Each year, Stack Overflow surveys its users, who are primarily software developers, about themselves, how they use Stack Overflow, their work, and the development tools they use. In this analysis we'll look at a subset of the survey responses, from users who identified as Data Scientists.

```{r}
# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)
library(fst)
library(tibble)
library(readr)
```

```{r include=FALSE}
ruta_stckov <- "C:/Users/JuanFer Mosquera/Documents/datasets/stack_overflow.rds"
```

```{r}
stack_overflow <- read_rds(ruta_stckov)
stack_overflow
colnames(stack_overflow)
```

```{r}
glimpse(stack_overflow)
```

### **Hypothesizing about the mean**

Let's hypothesize that the mean annual compensation of the population of data scientists is \$110000. The first thing we can do to check this is to examine the sample data in the survey. Annual compensation, converted to dollars, is included in the converted_comp column. The sample mean is a type of point estimate, which is another name for a summary statistic. You can calculate it with base-R, or more verbosely with dplyr. Either way, the answer is \$122000. That's different from our hypothesis, but is it meaningfully different?

```{r}
mean_comp_samp <- mean(stack_overflow$converted_comp)
mean_comp_samp
```

```{r}
mean_comp_samp <- stack_overflow %>%
  summarise(mean_compensation = mean(converted_comp)) %>%
  pull(mean_compensation)
mean_comp_samp
```

### **Generating a bootstrap distribution**

To answer this, we need to generate a bootstrap distribution of sample means. This is done by resampling the dataset, calculating the sample mean for that resample, then repeating those steps.

```{r}
# Repeat steps 1 & 2 many times 
so_boot_distn <- replicate(
  n = 5000,
  expr = {
    #Step 1. Resample
    stack_overflow %>%
      slice_sample(prop = 1, replace = TRUE) %>%
      # Step 2. Calculate point estimate
      summarise(mean_compensation = mean(converted_comp)) %>%
      pull(mean_compensation)
  }
)
head(so_boot_distn)
```

### **Visualizing the bootstrap distribution**

Here's a histogram of the bootstrap distribution. Its bell shape means it's roughly normally distributed. Notice that $110000$ is on the left of the distribution.

```{r}
tibble(resample_mean = so_boot_distn) %>%
  ggplot(aes(resample_mean)) + 
  geom_histogram(binwidth = 1000)
```

### **Standard Error**

Recall that the standard deviation of the sample statistics in the bootstrap distribution estimates the standard error of the statistic.

```{r}
std_error <- sd(so_boot_distn)
std_error
```

### **Z-scores**

Since variables have arbitrary units and ranges, before we test our hypothesis, we need to standardize the values. A common way of standardizing values is to **subtract the mean**, and **divide by the standard deviation**. For hypothesis testing, we use a variation where we take the **sample statistic, subtract the hypothesized parameter value, and divide by the standard deviation of the statistic**, **which is the standard error**. The result is called a **z-score**. The sample mean annual compensation for data scientists of \$122000 , minus the hypothesized compensation for data scientists of \$110000, divided by the bootstrap standard error of \$5590 gives a z-score 1.71

```{r}
mean_comp_hyp <- 110000
mean_comp_samp

z_score <- (mean_comp_samp - mean_comp_hyp) / std_error
z_score
```

### **Testing the hypothesis**

Is that a big or small number? Determining that is the goal of this course. In particular, we can now state one of the uses of hypothesis testing. Determining whether a sample statistic is close to or far away from an expected value.

### **Standard normal (z) distribution**

Here's a density plot of the probability density function for the standard normal distribution. That is, a normal distribution with mean zero and standard deviation one. It's often called the z-distribution. As you might guess, z-scores are related to the z-distribution.

```{r}
# Create a dataset of z-distribution and plot it
tibble(x = seq(-4, 4, 0.01)) %>%
  ggplot(aes(x)) + 
  stat_function(fun = dnorm) + 
    ylab("PDF(x)")
```

### **P-values**

Hypothesis tests are like criminal trials.

The true situation is that the defendant committed the crime, or they didn't. There are also two possible outcomes. The judge gives a guilty or a not guilty verdict. The initial assumption is that the defendant is not guilty. It's up to the prosecution team to come up with evidence beyond a reasonable doubt that the defendant committed the crime in order for a guilty verdict to be given.

### **Age of first programming experience**

Let's return to our survey. The `age_first_code_cut` variable classifies when the user began programming. If they were 14 or older, they are classified as **"adult"**; otherwise **"child"**.

Suppose previous research suggests that 35% of software developers programmed as children. This raises a question answerable with our dataset.

**Does our sample provide evidence that data scientists have a greater proportion starting programming as a child**?

### **Definitions**

Hypothesis is a statement about a **population parameter**. We don't know the true value of this population parameter; we can only make inferences about it from the data. **Hypothesis tests compare two competing hypotheses**. These two hypotheses are the **null hypothesis**, representing the existing idea, and the **alternative hypothesis**, representing a new idea.

They are denoted $H_{0}$ and $H_{A}$ respectively. **The null hypothesis is like the current champion, and the alternative hypothesis is like a challenger trying to overthrow that champion**. Here, the null hypothesis is that our data won't tell us anything new, and that the proportion of data scientists starting programming as children follows the previous research on software developers, at 35%. The alternative hypothesis is that our hunch is correct, and that the percentage is greater than 35%.

### **Criminal trials vs. Hypothesis Testing**

Let's compare the criminal trial with the hypothesis test. The defendant committing the crime is equivalent to the **alternative hypothesis being true**, and the defendant not committing the crime is equivalent to the **null hypothesis being true**. Rather than saying we accept the alternative hypothesis, **the verdicts are rejecting the null hypothesis**, or **failing to reject the null hypothesis**. Initially, we assume that the null hypothesis is true, and this only changes if the sample provides enough evidence to reject it.

The hypothesis testing equivalent of "beyond a reasonable doubt" is known as the significance level.

### **One-tailed & two-tailed tests**

The tails of a distribution are the left and right edges of its PDF. **Hypothesis tests determine whether the sample statistics lie in the tails of the distribution**. There are three types of tests, and the phrasing of the alternative hypothesis determines which type you should use. In this case, we need a **right-tailed test**.

### **P-values**

**p-values** measure the strength of support for the null hypothesis. **Small p-values mean our statistic is producing an unlikely result in a tail of our null distribution**. **p-values** are probabilities, so they are always **between zero and one**.

### **Defining p-values**

The definition of a p-value has four parts. It's a probability related to where the statistic from our sample lies on the null distribution. **It measures how many values fall farther away than what we observed in the direction of the alternative hypothesis**. It is based on the null distribution, which assumes that the null hypothesis is true. The **p-value** measures evidence. Does it favor the original assumption or the new alternative?

### **Calculating the z-score**

To calculate the **p-value**, we need to run through the same steps as before. We get the **sample statistic**, in this case **the proportion of data scientists who started programming as children**. We get the hypothesized value from the null hypothesis, $35\%$. We get the standard error from the bootstrap distribution. The **z-score** is the difference between the proportions, divided by the standard error.

```{r}
prop_child_samp <- stack_overflow %>%
  summarise(point_estimate = mean(age_first_code_cut == "child")) %>%
  pull(point_estimate)

prop_child_samp
```

```{r}
sd_boot_distn <- replicate(
  n = 5000,
  expr = {
    #Step 1. Resample
    stack_overflow %>%
      slice_sample(prop = 1, replace = TRUE) %>%
      # Step 2. Calculate point estimate
      summarise(sample_mean = mean(age_first_code_cut == "child")) %>%
      pull(sample_mean)
  }
)
head(sd_boot_distn)
```

```{r}
std_err = sd(sd_boot_distn)
std_err
```

```{r}
prop_child_hyp <- 0.35
#std_err <- 0.0096028
z_score <- (prop_child_samp - prop_child_hyp) / std_err
z_score
```

### **Calculating the p-value**

The last step is new. We pass the z-score to the normal cumulative distribution function, pnorm. Since we want the right-hand tail for this test, we set lower-dot-tail to FALSE. The p-value is 8.04 e -10 That's a very small number, but is it small enough to reject the null hypothesis?

```{r}
p_value <- pnorm(z_score, lower.tail = FALSE)
p_value
```

### **Statistical significance**

**p-values** quantify how much evidence there is for the null hypothesis. **Large p-values lead us to not have enough evidence for the alternative hypothesis**, sticking with the assumed null hypothesis instead. **Small p-values make us doubt this original assumption in favor of the alternative hypothesis**. What defines the cutoff point between a small p-value and a large one?

### **Significance level**

The cutoff point is known as the **significance level**, and is denoted alpha (**α**). The appropriate significance level depends on the dataset and the discipline you are working in. **5%** is the most common choice, but **10%** and **1%** are also popular. The significance level gives us a decision process for which hypothesis to support. **If the p-value is less than alpha**, we **reject the null hypothesis**. **Otherwise we fail to reject it**. It's important that you **decide what the appropriate significance level should be before you run your test**. Otherwise there is a temptation to decide on a significance level that lets you choose the hypothesis you want.

### **Calculating the p-value**

The workflow is now to set the significance level, in this case 0.05. Then calculate the z-score and use the standard normal CDF to get the p-value. In this case, the p-value of **8 e-06** is less than or equal to point-zero-five, so we reject the null hypothesis. We have strong evidence for the alternative hypothesis that the proportion of data scientists starting programming as children is greater than **35%**.

```{r}
alpha <- 0.05

prop_child_samp <- stack_overflow %>%
  summarise(point_estimate = mean(age_first_code_cut == "child")) %>%
  pull(point_estimate)

prop_child_hyp <- 0.35 
std_error <- 0.0096028
z_score <- (prop_child_samp - prop_child_hyp) / std_error
p_value = pnorm(z_score, lower.tail = FALSE)
p_value
```

```{r}
p_value <= alpha
```

#### 

$$
p-value \ is \ less \ than \ or \ equal \ to \ \alpha, so \ reject \ H_{0} \ and \ accept \ H_{A}
$$

### **Confidence intervals**

To get a sense as to potential values of the population parameter, it's common to choose a confidence interval of one minus the significance level. For a significance level of **0.05**, we'd use a 95% confidence interval. Here's the calculation using the quantile method. The interval runs from **0.37** to **0.41** giving a range of plausible values for the proportion of data scientists starting programming as children.

```{r}
first_code_boot_samp <- replicate(
  n = 5000,
  expr = {
    stack_overflow %>%
      slice_sample(prop = 0.25, replace = TRUE) %>%
      summarise(point_estimate = mean(age_first_code_cut == "child")) %>%
      pull(point_estimate)
  }
)

first_code_boot_distn <- tibble(first_code_child_rate = first_code_boot_samp)
head(first_code_boot_distn)
```

```{r}
conf_int <- first_code_boot_distn %>%
  summarise(lower = quantile(first_code_child_rate, 0.025),
            upper = quantile(first_code_child_rate, 0.975))

conf_int
```

### **Types of errors**

Returning to the criminal trial analogy, there are two possible truth states and two possible outcomes, for four combinations in total. Two of these indicate that the verdict was correct. If a defendant didn't commit the crime, but the verdict was guilty, they are wrongfully convicted. If the defendant committed the crime but the verdict was not guilty, they got away with it. These are both errors in justice. Similarly, for hypothesis testing, there are two ways to get it right, and two types of error. If you support the alternative hypothesis when the null hypothesis was correct, you made a false positive error. If you support the null hypothesis when the alternative hypothesis was correct, you made a false negative error. These errors are sometimes known as type one and type two errors.

|                | Actual H0      | Actual HA      |
|----------------|----------------|----------------|
| **Choosen H0** | correct        | false negative |
| **Choosen HA** | false positive | correct        |

### **Possible errors in our example**

#### 

$$
if \ p \leqslant \alpha, we \ reject \ \ H_{0} 
$$

A false positive (Type I) error could have ocurred: We thought that data scientists starting coding as children at a higher rate when in reality they did not

#### 

$$
if \ p > \alpha, \ we \ fail \ to \ reject \ H_{0}
$$

A false negative (Type II) error could have ocurred: we thought that data scientists coded as children at the same rate as software engineers when in reality they coded as children at a higher rate.

### **Performing t-tests**

Here, we'll look at the related problem of **comparing sample statistics across groups of a variable**. In the Stack Overflow dataset, `converted_comp` is a numerical variable of annual compensation. `age_first_code_cut` is a categorical variable with two levels: **child and adult**, which describe when the user started programming. We can ask questions about differences in compensation across the two age groups. That is, **do users who first programmed as a child tend to be compensated higher than those that started as adults**?

### **Hypotheses**

The null hypothesis is that **the sample mean for the two groups is the same**, and the alternative hypothesis is that **the sample statistic for users who started coding as children is greater than for users who started coding as adults**. We can write these hypotheses using equations. Mu (**µ**) represents an unknown population mean, and we use subscripts to denote which group the sample mean belongs to. An alternate way of writing the equations is to compare the differences in sample means to zero.

### **Calculating group wise summary statistics**

Calculating the summary statistics for each group is straightforward. You start with the sample, group by the categorical variable, then summarize the numeric variable. A `dplyr` way of doing this is shown, but there are many possibilities. Pause the video for a moment and think about different ways of solving this. Here, the child programmers have a mean compensation of \$132419.6 dollars compared to \$111313.3 for adult programmers. Is that increase statistically significant or could it be explained by sampling variability?

```{r}
stack_overflow %>%
  group_by(age_first_code_cut) %>%
  summarise(mean_compensation = mean(converted_comp))
```

### **Test statistics**

Although we don't know the population mean, **we estimate it using the sample mean**. **x-bar** is used to denote a sample mean. Then we use subscripts to denote which group a sample mean corresponds to. **The difference between these two sample means is called the test statistic for the hypothesis test**. The z-scores are another type of test statistic.

### **Standardizing the test statistic**

z-scores are calculated by taking the sample statistic, subtracting the mean of this statistic as the population parameter of interest, then dividing by the standard error. In the two sample case, the test statistic, denoted t, uses a similar equation. You take the difference between the sample statistics for the two groups, subtract the hypothesized difference between the two groups, then divide by the standard error. $$
x = 
\frac{sample \ stat + population \ parameter}{standard \ error}
$$

$$
t = \frac{difference \ in \ sample \ stats \ - difference \ in \ population \ parameters}{standard \ error}
$$

$$
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult}) \ - (\mu_{child} \ - \mu_{adult})}{SE(\bar{x}_{child} \ - \bar{x}_{adult})}
$$

### **Standard Error**

To calculate the standard error, needed for the denominator of the test statistic equation, bootstrapping is most accurate. However, there is an easier way to approximate it. You calculate the standard deviation of the numeric variable for each group in the sample, and the number of observations for each group. Then it's just some arithmetic.

### **Assuming the null hypothesis is true**

Here's the test statistic equation again. If we assume that the **null hypothesis is true**, there's a simplification we can make. **The null hypothesis assumes that the population means are equal**. That makes one term in the numerator disappear. Using the approximation for the standard error, we now have a way of calculating the test statistic using only calculations on the sample dataset. We need the mean, standard deviation, and number of observations for each group. This is another group-by and summarize combination. $$
SE(\bar{x}_{child} \ - \bar{x}_{adult}) \approx \sqrt{\frac{s^2_{child}}{n_{child}} \ + \frac{s^2_{adult}}{n_{adult}}}
$$ $$
H_{0}: \mu_{child} \ - \mu_{adult} = 0
$$

$$
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult})}{SE(\bar{x}_{child} \ - \bar{x}_{adult})}
$$

$$
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult})}{\sqrt{\frac{s^2_{child}}{n_{child}}} \ + \frac{s^2_{adult}}{n_{adult}}}
$$

```{r}
group_tests <- stack_overflow %>%
  group_by(age_first_code_cut) %>%
  summarise(x_bar = mean(converted_comp), s = sd(converted_comp), n = n())

# See the results
group_tests
```

### **Calculating the test statistic**

Annoyingly, it requires some grim data manipulation code to calculate the test statistic from this data frame. Since this isn't a course on data manipulation, let's do the calculation with separate variables. The numerator is a simple subtraction, and the denominator is like a weighted hypotenuse. The t-statistic is **1.86**. Just as with z-scores, we can't do anything with that number yet

#### 

$$
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult})}{\sqrt{\frac{s^2_{child}}{n_{child}} \ + \frac{s^2_{adult}}{n_{adult}}}}
$$

```{r}
numerator <- group_tests$x_bar[2] - group_tests$x_bar[1]
denominator <- sqrt((group_tests$s[2]^2 / group_tests$n[2]) + (group_tests$s[1]^2 / group_tests$n[1]))
t_stat <- numerator / denominator
t_stat
```

### **Hypothesis testing workflow**

1.  Identify population parameter that is hypothesized about

2.  Specify the null & alternative hypothesis

3.  Determine (standardized) test statistic & corresponding null distribution

4.  Conduct hypothesis test in R

5.  Measure evidence against the null hypothesis

6.  Make a decision comparing evidence to significance level

7.  Interpret the results in the context of the original problem

### **Calculating p-values from t-statistics**

The test statistic, t, follows a t-distribution. t-distributions have a parameter called the **degrees of freedom**, or **df** for short. Here's a line plot of the PDF of a t-distribution with one degree of freedom in yellow, and the PDF of a normal distribution in blue dashes. Notice that the t-distribution for small degrees of freedom has fatter tails than the normal distribution, but otherwise they look similar.

### **Degrees of freedom**

As you increase the degrees of freedom, the t-distribution gets closer to the normal distribution. In fact, **a normal distribution is a t-distribution with infinite degrees of freedom**. Degrees of freedom are defined as **the maximum number of logically independent values in the data sample**. That's a fairly tricky concept, so let's try an example.

### **Calculating degrees of freedom**

Suppose your dataset has 5 independent observations. Four of the values are 2, 6, 8, and 5. You also know the sample mean is 5. The last value is no longer independent; it must be 4. Even though all five observations in the sample were independent, because you know an additional fact about it, you only have 4 degrees of freedom. In our two sample case, there are as many degrees of freedom as observations, minus two because we know two sample statistics.

### **Hypotheses**

Recall the hypotheses for our Stack Overflow question about compensation for the two age groups. Since this is a "greater than" alternative hypothesis, we need a right-tailed test.

$H_{0}:$ *The mean compensation (in USD) is the same for those that coded first as a child and those that coded first as an adult.*

$H_{A}:$ *The mean compensation (in USD) is **greater** for those that coded first as a child compared to those that coded first as an adult.*

Use a **right-tailed test**

### **Significance level**

We're going to calculate a p-value in a moment. We need to decide on a significance level before we do that. There are several possibilities; I'm going to use 0.1

That means that if the p-value is less than point-one, we reject the null hypothesis in favor of the alternative.

$$\alpha = 0.1$$

$$
if \ p \leqslant \alpha \ then \ reject \ H_{0}
$$

### **Calculating p-values: One proportion vs. a value**

**`p_value <- pnorm(z_score, loer.tail = FALSE)`**

### **Calculating p-values: Two means from different groups**

$$
t = \frac{(\bar{x}_{child} \ - \bar{x}_{adult})}{\sqrt{\frac{s^2_{child}}{n_{child}} \ + \frac{s^2_{adult}}{n_{adult}}}}
$$

Now we are calculating means rather than proportions, the z-score is replaced with a t-test statistic. The calculation also needs the degrees of freedom, which is the number of observations in both groups, minus two.

In the previous steps, we used an approximation using sample information (not bootstrapping) for the test statistic standard error. A consequence of this is that to calculate the p-value, we need to transform the test statistic using the t-distribution CDF instead of the normal distribution CDF. Using this approximation adds more uncertainty and that's why this is a t instead of a z problem.

The t distribution allows for more uncertainty when using multiple estimates in a single statistic calculation. Notice the use of `pt()` instead of `pnorm()`, and that the df argument is set to the degrees of freedom. This p-value is less than the significance level of **0.1**, we should reject the null hypothesis in favor of the alternative hypothesis that Stack Overflow data scientists who started coding as a child earn more.

```{r}
numerator <- group_tests$x_bar[2] - group_tests$x_bar[1]
denominator <- sqrt((group_tests$s[2]^2 / group_tests$n[2]) + (group_tests$s[1]^2 / group_tests$n[1]))
t_stat <- numerator / denominator
t_stat
```

```{r}
degrees_of_freedom <- group_tests$n[2] + group_tests$n[1] - 2
degrees_of_freedom
```

```{r}
p_value <- pt(t_stat, df = degrees_of_freedom, lower.tail = FALSE)
p_value
```

### **Why is t needed?**

The process for calculating p-values is to start with the sample statistic, standardize it to get a test statistic, then transform it via a cumulative distribution function. Previously, that final transformation was denoted `z`, and the CDF transformation used the (standard normal) z-distribution. The test statistic was denoted `t`, and the transformation used the t-distribution.

In which hypothesis testing scenario is a t-distribution needed instead of the z-distribution?

**Using a sample standard deviation to estimate the standard error is computationally easier than using bootstrapping. However, to correct for the approximation, you need to use a t-distribution when transforming the test statistic to get the p-value.**

## **ANOVA tests**

We´ve seen how to compare two groups in the unpaired and paired cases. What if there are more than two groups?

### **Job satisfaction: 5 categories**

The Stack Overflow survey includes a job satisfaction variable, with five categories from "Very dissatisfied" to "Very satisfied".

```{r}
stack_overflow %>%
  count(job_sat)
```

### **Visualizing multiple distributions**

Suppose we want to know if mean annual compensation is different for each of the levels of job satisfaction. The first thing to do is visualize the distributions with box plots. I've used coord_flip to swap the x and y axes, making the category labels easier to read. "Very satisfied" looks slightly higher than the others, but to see if they are significantly different, we'll need to use hypothesis tests.

#### Question: Is mean annual compensation different for different levels of job satisfaction?

```{r}
stack_overflow %>% 
  ggplot(aes(x = job_sat, y = converted_comp)) + 
  geom_boxplot() + 
  coord_flip()
```

### **Analysis of variance (ANOVA)**

**ANOVA tests determine whether there are differences between the groups**. First, you fit a linear regression. You call `lm()`, **specifying the numeric variable as the response on the left-hand-side of the formula, and the categories as the explanatory variable on the right-hand side**. Then you call `anova()` to perform an analysis of variance test.

In the `job_sat` row, the right-hand column contains a p-value, which is **0.0013**. The two stars next to it tell you that **this p-value is significant at the point-zero-one level**. That means that at least two of the categories of job satisfaction have significant differences between their compensation levels. The problem is that this method doesn't tell you which two categories they are. For this reason, ANOVA is less useful than pairwise t-tests.

```{r}
mdl_comp_vs_sat <- lm(converted_comp ~ job_sat, data = stack_overflow)
mdl_comp_vs_sat
```

```{r}
anova(mdl_comp_vs_sat)
```

### **Pairwise tests**

To compare all five categories of job satisfaction using hypothesis tests, we can test each pair. There are ten ways of choosing two items from a set of five, so we have ten tests to perform. We'll set the significance level to point-two.

-   $\mu_{very \ dissatisfied} \neq \mu_{slightly \ dissatisfied}$

-   $\mu_{very \ dissatisfied} \neq \mu_{Neither}$

-   $\mu_{very \ dissatisfied} \neq \mu_{Slightly \ satisfied}$

-   $\mu_{very \ dissatisfied} \neq \mu_{Very \ satisfied}$

-   $\mu_{Slightly \ dissatisfied} \neq \mu_{Neither}$

-   $\mu_{Slightly \ dissatisfied} \neq \mu_{Slightly \ satisfied}$

-   $\mu_{Slightly \ dissatisfied} \neq \mu_{Very \ satisfied}$

-   $\mu_{Neither} \neq \mu_{Slightly \ satisfied}$

-   $\mu_{Neither} \neq \mu_{Very \ satisfied}$

-   $\mu_{Slightly \ satisfied} \neq \mu_{Very \ satisfied}$

### **pairwise.t.test()**

To run all these hypothesis tests in one go, you can use `pairwise.t-test()`. The first argument is the numeric variable whose sample means you are interested in. The second argument is the categorical variable defining the groups. We'll discuss `p.adjust.method` shortly. The result shows a matrix of ten p-values. Three of these are less than our significance level of point-two.

```{r}
anova_results <- pairwise.t.test(stack_overflow$converted_comp, stack_overflow$job_sat, p.adjust.method = "none")
anova_results
#p_values <- as.vector(unlist(na.omit(anova_results$p.value)))
p_values <- as.vector(unlist(anova_results$p.value))
p_values <- na.omit(p_values)
#p_values <- anova_results$p.value[complete.cases(anova_results$p.value)]
```

```{r}
p_values
```

### **As the no. of groups increases...**

In this case we have five groups, resulting in ten pairs. **As the number of groups increases, the number of pairs - and hence the number of hypothesis tests - increases quadratically**. The more tests you run, the higher the chance that at least one of them will give a false positive significant result. With a significance level of **0.2**, if you run one test, the chance of a false positive result is **0.2**. With five groups and ten tests, the probability of at least one false positive is around 0.7. With twenty groups, it's almost guaranteed that you'll get at least one false positive.

### **Bonferroni correction**

The solution to this is to apply an adjustment to increase the p-values, reducing the chance of getting a false positive. One common adjustment is the Bonferroni correction. Now only two of the pairs appear to have significant differences.

```{r}
pairwise.t.test(stack_overflow$converted_comp, stack_overflow$job_sat, p.adjust.method = "bonferroni")
```

### **More methods**

R provides several methods for adjusting the p-values. You can list their names with `p.adjust.methods =`.

**Holm** adjustment is the default. It's less strict than **Bonferroni**, but works well in most situations.

```{r}
p.adjust.methods
```

### **Bonferroni and Hold adjustments**

If you have ten tests, Bonferroni just multiplies the p-values by ten up to a maximum of one. Holm will multiply the smallest by ten, then the second smallest by nine, and so on. There's a possible extra correction to make sure that the order of p-values from smallest to largest is preserved, but that's the gist of it.

```{r}
p_values
```

#### **Bonferroni**

```{r}
pmin(1, 10 * p_values)
```

#### **Holm**

```{r}
pmin(1, 1:10 * sort(p_values))
```

## **One-sample proportion tests**

The previous example tested a single proportion against a specific value. As with means, we can also test differences between proportions in two populations.

### **Recap**

The previous hypothesis tests **measured whether or not an unknown population proportion was equal to some value**. We used a bootstrap distribution of the sample to e**stimate the standard error of the sample statistic**. The standard error was used to calculate a standardized test statistic, which was used to get a p-value, which was used to decide whether or not to reject the null hypothesis. Bootstrap distributions can be computationally intensive to calculate, so this time we'll calculate the test statistic without it.

### **Standardized test statistic for proportions**

An unknown population parameter that is a proportion, **or population proportion for short,** is denoted `p`. The sample proportion is denoted $\hat{p}$, and the hypothesized value for the population proportion is denoted $p_{0}$ . As seen previously, the standardized test statistic is a z-score. **You calculate it by starting with the sample statistic, subtracting its mean, then dividing by its standard error**.

$\hat{p}$ minus the mean of $\hat{p}$ , divided by the standard error of $\hat{p}$ .

Recall from Sampling in R, the mean of $\hat{p}$ is $p$. Under the null hypothesis, the unknown proportion p is assumed to be the hypothesized population parameter $p_{0}$. The z-score is now $\hat{p}$ minus $p_{0}$, divided by the standard error of $\hat{p}$.

$p$ : Sample proportion

$\hat{p}$ : sample proportion (sample statistic)

$p_{0}$ : hypothesized population proportion

$$ z = \frac{\hat{p} \ - mean(\hat{p})}{standard \ error(\hat{p})} = \frac{\hat{p} \ - p}{standard \ error(\hat{p})} $$

Assuming $H_{0}$ is true. $p = p_{0}$. so

$$ z = \frac{\hat{p} \ - p_{0}}{standard \ error(\hat{p})} $$

### **Easier standard error calculations**

Here's the approximate standard error for two sample means. For proportions, under $H_{0}$, the standard error of $\hat{p}$ is $p_{0}$ times $1 - p_{0}$ , divided by the number of observations, then square-rooted. We can substitute this into our equation for the z-score. This is easy to calculate because it only uses $\hat{p}$ and $n$, which we get from the sample, and $p_{0}$, which we chose.

$$ SE(\bar{x}_{child} \ - \bar{x}_{adult}) \approx \sqrt{\frac{s^2_{child}}{n_{child}} \ + \frac{s^2_{adult}}{n_{adult}}} $$

$$ SE_{\hat{p}} \ = \sqrt{\frac{p_{0} \ * (1 - p_{0})}{n}} $$

Assuming $H_{0}$ is TRUE,

$$ z \ = \frac{\hat{p} \ - p_{0}}{\sqrt{\frac{p_{0} \ * \ (1 \ - p_{0})}{n}}} $$

This only uses sample information ($\hat{p}$ and $n$) and the hypothesized parameter ($p_{0}$)

### **Why z instead of t?**

Why we used a z-distribution here, but a t-distribution previously?. **This is the test statistic equation for the two sample mean case**. The standard deviation of the sample, **s**, is calculated from the sample mean, $\bar{x}$. That means that $\bar{x}$ is used in the numerator to estimate the population mean, and in the denominator to estimate the population standard deviation. This dual usage causes an increase in our uncertainty about the estimate of the population parameter. Since t-distributions are effectively a normal distribution with fatter tails, we can use them to account for this extra uncertainty. In effect, the t-distribution provides extra caution against mistakenly rejecting the null hypothesis. For proportions, we only use $\hat{p}$ in the numerator, thus avoiding the problem with uncertainty, and a z-distribution is fine.

$$ t \ = \frac{(\bar{x}_{child} \ - \bar{x}_{adult})}{\sqrt{\frac{a^2_{child}}{n_{child}} \ + \frac{s^2_{adult}}{n_{adult}}}} $$

-   $s$ is calculated from $\bar{x}$, so $\bar{x}$ is used to estimate the population mean and to estimate the population standard deviation

-   This increases the uncertainty in our estimate of the population parameter.

-   t-distribution has fatter tails than a normal distribution

-   This gives an extra level of caution

-   $\hat{p}$ only appears in the numerator, so z-scores are fine

### **Stack overflow age categories**

Let's hypothesize that half the users are under thirty. We'll set a significance level of 0.01. Just over half the users are under thirty.

$H_{0}$ : The proportion of SO users under thirty is equal to 0.5

$H_{A}$ : The proportion of SO users under thirty is not equal to 0.5

```{r}
alpha <- 0.01

stack_overflow %>%
  count(age_cat)
```

### **Variables for z**

Let's get the numbers needed for the z-score. $\hat{p}$ is the proportion of rows in the sample where age_cat equals under thirty. $p_{0}$ is 0.5 according to the null hypothesis. n is the number of rows in the dataset.

```{r}
p_hat <- stack_overflow %>%
  summarise(prop_under_30 = mean(age_cat == "Under 30")) %>%
  pull(prop_under_30)

p_hat
```

```{r}
p_0 <- 0.5
n <- nrow(stack_overflow)
n
```

### **Calculating the z-score**

Calculating the z-score is just arithmetic; the value is three-point-five.

$$ z \ = \frac{\hat{p} \ - p_{0}}{\sqrt{\frac{p_{0} \ * \ (1 \ - p_{0})}{n}}} $$

```{r}
numerator <- p_hat - p_0
denominator <- sqrt((p_0 * (1 - p_0)) / n)
z_score <- numerator / denominator
z_score
```

### **Calculating the p-value**

For left-tailed alternative hypotheses, you transform the z-score into a p-value using the pnorm with the default of lower-dot-tail equals TRUE. For right-tailed alternative hypotheses you set lower-dot-tail to FALSE. For two-tailed alternative hypotheses, you check whether the test statistic lies in either tail, so the p-value is the sum of these two values. Since the normal distribution PDF is symmetric, here this simplifies to twice the left-tailed p-value. Here, the p-value is less than the significance level of point-zero-one, so we reject the null hypothesis and conclude that the proportion of users under thirty is not equal to point-five.

```{r}
# Left tailed ("less than")
p_value_less <- pnorm(z_score)
p_value_less
```

```{r}
# Right tailed ("greater than")
p_value <- pnorm(z_score, lower.tail = FALSE)
p_value
```

```{r}
# Two-tailed ("not equal")
p_value_2t <- pnorm(z_score) + pnorm(z_score, lower.tail = FALSE)
p_value_2t
```

```{r}
p_value_a <- 2 * pnorm(z_score)
p_value_a
```

```{r}
p_value <= alpha
```

## Two sample proportion tests

The previous example **tested a single proportion against a specific value**. As with means, we can also test differences between proportions in two populations.

### **Comparing two proportions**

The Stack Overflow survey contains a hobbyist variable. The value "Yes" means the user described themself as a hobbyist; "No" means they described themself as a professional. **We can hypothesize that the proportion of users who are hobbyists is the same for the under thirty age category as the at least thirty category**.

More formally, **the null hypothesis is that the difference between the population parameters for each group is zero**. I've set a significance level of 0.05.

$H_{0}$ : The proportion of SO users who are hobbyists is the same for those under thirty as those at least thirty

$$
H_{0}: p_{\geq 30} \ - p_{< 30} \ = 0
$$

$H_{A}$ : The proportion of SO users who are hobbyists is different for those under thirty as those at least thirty

$$
H_{A}: p_{\geq 30} \ - p_{<30} \neq 0
$$

### **Calculating the z-score**

Let's break down this z-score equation. The sample statistic is the difference in the proportions for each category $(\hat{p}_{\geq30} \ - \hat{p}_{<30}) \ - 0$. That's the two $\hat{p}$ values on the numerator. We subtract the hypothesized value of the population parameter. Assuming the null hypothesis is true, **that's just zero**, so the term disappears. The denominator is the standard error of the sample statistic $SE(\hat{p}_{\geq30} \ - \hat{p}_{<30})$.

Again we can avoid having to generate a bootstrap distribution. The equation for the standard error is a slightly more complicated version of the equation for the one sample case. In this equation, note that p-hat is the sample proportion for the whole dataset, not for each category. This whole dataset $\hat{p}$ is known as a **pooled estimate of the population proportion**. We need one more equation to get $\hat{p}$. It's a weighted mean of the p-hats for each category. This looks horrendous, but it's just arithmetic, and R is pretty good at that. The good news is that we only need four numbers from the sample dataset to do these calculations.

$$
z = \frac{(\hat{p}_{\geq30} \ - \hat{p}_{<30}) \ - 0}{SE(\hat{p}_{\geq30} \ - \hat{p}_{<30})}
$$

$$
SE(\hat{p}_{\geq0} \ - \hat{p}_{<0}) \ = \sqrt{\frac{\hat{p} \times \ (1 \ - \hat{p})}{n_{\geq30}} \ + \frac{\hat{p} \times \ (1 \ - \hat{p})}{n_{<30}}} 
$$

$\hat{p}$ is a pooled estimate for $p$ (common unknown proportion of successes)

$$
\hat{p} \ = \frac{n_{\geq30} \times \hat{p}_{\geq30} \ + n_{<30} \times \hat{p}_{<30}}{n_{\geq30} \ + n_{<30}}
$$

We only need to calculate 4 numbers: $\hat{p}_{\geq30}, \ \hat{p}_{<30}, \ n_{\geq30}, \ n_{<30}$

### **Getting the numbers for the z-score**

To calculate these four numbers you `group_by()` the categories, and summarize to calculate the sample proportion and row counts. After that, you can do the arithmetic to get the test statistic - the z-score is minus four-point-two - and call pnorm to get the p-value. I'm not going to run through these steps because it's exactly the same as you've seen already. Instead, I'll show you an easier way.

```{r}
stack_overflow %>%
  group_by(age_cat) %>%
  summarise(p_hat = mean(hobbyist == "Yes"), n = n())
```

### **Proportion tests using `prop_test()`**

Naturally, **R has at least two functions for performing this type of hypothesis test**, known as a **proportion test**. Base-R has a function named `prop.test()`, though unfortunately, its interface is somewhat peculiar. **To make your life easier it's best to use `prop_test()` from infer**. This takes a formula, **with your numeric variable on the left and the categories on the right**. The order argument specifies which $\hat{p}$ should be subtracted from the other. Here, it is specified so that you start with the "at least thirty" $\hat{p}_{\geq30}$ , then subtract the "under thirty" $\hat{p}_{<30}$

The `success` argument specifies **which of the two response levels to get the proportion of**. `alternative` takes the same values as t-test. Here, **we have a two-sided alternative hypothesis**. The `correct` argument **specifies whether or not to apply Yates' continuity correction**. This is a fudge factor **needed for technical reasons when the sample size is very small**. Since **each group has over one thousand observations, we don't need it**. The p-value is in the third column of the resulting tibble. It's smaller than the significance level we specified earlier, so **we can conclude that there is a difference in the proportion of Stack Overflow hobbyists between the under and over thirty groups**.

```{r}
library(infer)
stack_overflow %>%
  prop_test(hobbyist ~ age_cat, order = c("At least 30", "Under 30"), success = "Yes", 
            alternative = "two-sided", correct = FALSE)
```

### **Chi-square test of independence**

Just as ANOVA extends t-tests to more than two groups, chi-square tests of independence extend proportion tests to more than two groups.

### **Revisiting the proportion test**

Here's the proportion test from last time. In the first column, the test statistic is the square of the z-score. Minus four-point-two-two squared is seventeen-point-eight. In the second column, chisq_df means the degrees of freedom of a chi-square test. Its value is one.

```{r}
library(infer)
stack_overflow %>%
  prop_test(hobbyist ~ age_cat, order = c("At least 30", "Under 30"), alternative = "two-sided", correct = FALSE)
```

### **Independence of variables**

That proportion test had a positive result. There was evidence that the hobbyist and age category variables had an association. If that wasn't the case, and the proportion of hobbyists was the same for each age category, the variables would be considered statistically independent. More formally, **statistical independence of two categorical variables is when the proportion of successes in the response variable is the same across all categories of the explanatory variable**.

### **Job satisfaction and age category**

Recall that the Stack Overflow sample has an age category variable with two categories and a job satisfaction variable with five categories.

```{r}
stack_overflow %>%
  count(age_cat)
```

```{r}
stack_overflow %>%
  count(job_sat)
```

### **Declaring the hypothesis**

We can declare hypotheses to test for independence of these variables. Here, **age category is the response variable**, and **job satisfaction is the explanatory variable**. **The null hypothesis is that independence occurs.** I've set a **significance level of 0.1**. The test statistic is denoted chi-square. **It quantifies how far away the observed results are from the values you'd expect if independence was true**.

$H_{0}$ : Age categories are independent of job satisfaction levels

$H_{A}$ : Age categories are not independent of job satisfaction levels

Test statistic denoted $\chi^2$

Assuming the independence, how far away are the observed results from the expected values?

```{r}
alpha = 0.1
```

### **Exploratory visualization: Proportional stacked bar plot**

Let's explore the data using a proportional stacked bar plot. `fill` means two different things here. In the aesthetics, `fill` refers to the fill color of the bars. In `geom_bar`'s position argument, `"fill"` makes each bar the same height, so you can compare proportions. If the age category was independent of the job satisfaction, the split between the age categories would be at the same height in each of the five bars. There's some variation here, but we'll need the hypothesis test to determine whether it's a significant difference.

```{r}
ggplot(stack_overflow, aes(job_sat, fill = age_cat)) + 
  geom_bar(position = "fill") +
  ylab("proportion")
```

### **Chi-square independence test using `chisq_test()`**

The hypothesis test for independence is called a **chi-square independence test.** There's a base-R function for it called `chisq.test()`, but like `prop.test()`, it's fiddly to use. The easiest option is to use infer's `chisq_test()`. Pipe from the sample dataset, **passing a formula with the response variable on the left and the explanatory variable on the right**. The p-value is **0.23**, which is **above the significance level** we set, so **we conclude that age categories are independent of job satisfaction**. The chi-square distribution, whose CDF is used to calculate the p-value from the test statistic, has a degrees of freedom argument, just like the t-distribution. The results show that there are four degrees of freedom. This is the number of response categories minus one, times the number of explanatory categories minus one. Two minus one times five minus one is four.

```{r}
stack_overflow %>%
  chisq_test(age_cat ~ job_sat)
```

#### **Degrees of freedom**

$$
(No. \ of \ response \ categories \ - 1) \times (No. \ of \ explanatory \ categories \ - 1)
\\(2 \ - 1) \times (5 \ - 1) \ = 4
$$

### **Swapping the variables**

If we swap the variables, so age category is the response, and job satisfaction is the explanatory variable, the visual inspection technique is the same: see if the splits for each bar are in similar places.

```{r}
ggplot(stack_overflow, aes(age_cat, fill = job_sat)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")
```

### **Chi-square both ways**

If we run the chi-square test with the variables swapped, then the results are identical. **We should ask questions like "are variables X and Y independent?"**, not "is variable X independent from variable Y?", since the order doesn't matter.

```{r}
stack_overflow %>%
  chisq_test(age_cat ~ job_sat)
```

```{r}
stack_overflow %>%
  chisq_test(job_sat ~ age_cat)
```

### **What about directions and tails?**

We didn't worry about tails in this test, and in fact `chisq_test()` **doesn't have an alternative argument**. This is **because the test statistic is based on the square of observed and expected counts**, and square numbers are non-negative. **That means that chi-square tests tend to be right-tailed tests**.

^1^ Left-tailed chi-square tests are used in statistical forensics to detect is a fit is suspiciously good because the data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses though.

```{r}
args(chisq_test)
```

### **Chi-square goodness of fitness test**

Last time you used a chi-square test to compare proportions in two categorical variables. This time, we'll use another variant of the chi-square test to compare a single categorical variable to a hypothesized distribution.

### **Purple links**

The Stack Overflow survey contains a fun question about how the user feels when they discover that they already visited the top resource when trying to solve a coding problem. There are four possible answers, stored in the purple_link variable.

```{r}
purple_link_counts <- stack_overflow %>%
  count(purple_link) %>%
  arrange(desc(n))

purple_link_counts
```

### **Declaring the hypothesis**

Let's hypothesize that half the users in the population would respond "Hello old friend", and the other three responses would get one sixth each. The `tribble` function provides a convenient way of manually entering values for a data frame, and returns a standard tibble. We can specify the hypotheses as whether or not the sample matches this hypothesized distribution. The test statistic measures how far the distribution of proportions observed in the sample is from the hypothesized distribution of proportions. I'll set a significance level of point-zero-one.

^1^ tribble is short for "row-wise tibble"; not to be confused with the alien species from Star Trek

```{r}
hypothesized <- tribble(~ purple_link, ~ prop,
                        "Hello, old friend", 1/2,
                        "Amused", 1/6,
                        "Indifferent", 1/6,
                        "Indifferent", 1/6,
                        "Annoyed", 1/6)
hypothesized
```

### **Hypothesized counts by category**

To visualize the distribution of the purple links, it will help to have the hypothesized counts. These are the hypothesized proportions times the total number of observations in the sample.

```{r}
n_total <- nrow(stack_overflow)
hypothesized <- tribble(~ purple_link, ~ prop,
                        "Hello, old friend", 1/2,
                        "Amused", 1/6,
                        "Indifferent", 1/6,
                        "Indifferent", 1/6,
                        "Annoyed", 1/6) %>%
  mutate(n = prop * n_total)

hypothesized
```

### **Visualizing counts**

The natural way to visualize the categories is with a bar plot. We're using `geom_col()` here since we already calculated the counts. The purple points show the hypothesized counts, so you can compare the sample distribution to the hypothesized distribution. Two of the bars are close to the values we hypothesized and two are slightly different. We'll need to run the hypothesis test to see if the differences are statistically significant.

```{r}
ggplot(purple_link_counts, aes(purple_link, n)) +
  geom_col() + 
  geom_point(data = hypothesized, color = "red")
```

### **Chi-square goodness of fit test using `chisq_test()`**

The one sample chi-square test is called a goodness of fit test. To run it, we need the hypothesized proportions in vector form rather than in a tibble. Again we use `chisq_test()` from `infer`. However, this time the arguments are different. Piping from the dataset, we set response to the name of the column of interest, and set $p$ to the hypothesized distribution. **The degrees of freedom are one less than the number of choices in the survey**. Four minus one is three. The **p-value is very small, much lower than the significance level we set**. Thus **we conclude that the sample distribution of proportions is different than the hypothesized distribution of proportions**.

```{r}
hypothesized_props <- c("Hello, old friend" = 1/2, Amused = 1/6, Indifferent = 1/6, Annoyed = 1/6)
```

```{r}
stack_overflow %>%
  chisq_test(response = purple_link, p = hypothesized_props)
```

### Testing sample size

In order to conduct a hypothesis test, and be sure that the result is fair, a sample must meet three requirements: it is a random sample of the population; the observations are independent; and there are enough observations. Of these, only the last condition is easily testable with code.

The minimum sample size depends on the type of hypothesis tests you want to perform. Let's test some scenarios on the `late_shipments` dataset.

-   Using the `late_shipments` dataset, get counts by the `freight_cost_group` columns.

-   Insert a suitable number to inspect whether the counts are "big enough" for a two sample t-test.

```{r}
# Get counts by freight_cost_group
counts <- late_shipments %>%
  count(freight_cost_group)

# See the result 
counts

# Inspect whether the counts are big enough
print(all(counts$n >= 30))
```

-   Using the `late_shipments` dataset, get counts by the `late` column.

-   Insert a suitable number to inspect whether the counts are "big enough" for a one sample proportion test.

```{r}
# Get counts by late
counts <- late_shipments %>%
  count(late)

# See the results 
counts

# Inspect whether the counts are big enough
all(counts$n >= 10)
```

-   Using the `late_shipments` dataset, get counts by the `vendor_inco_term` and `freight_cost_group` columns.

-   Insert a suitable number to inspect whether the counts are "big enough" for a chi-square independence test.

```{r}
# Count the values of vendor_inco_term and freight_cost_group
counts <- late_shipments %>%
  count(vendor_inco_term, freight_cost_group)

# See the results 
counts

# Inspect whether the counts are big enough
all(counts$n >= 5)
```

-   Using the `late_shipments` dataset, get counts by the `shipment_mode` column.

-   Insert a suitable number to inspect whether the counts are "big enough" for an ANOVA test.

    ```{r}
    # Count the values of shipment_mode
    counts <- late_shipments %>%
      count(shipment_mode)

    # See the result
    counts

    # Inspect whether the counts are big enough
    all(counts$n >= 30)
    ```

### **The "There is only one test" framework**

Traditional hypothesis tests make assumptions about the sample dataset. Here, we'll look at what to do when those assumptions don't hold.

### **Imbalance data**

Sometimes you can't get a sample representing all parts of your population. Consider this subset of the Stack Overflow survey. There are fewer users with hobbyist "No" than "Yes", and fewer users with "At least 30" ages than "Under 30". In fact, there are no users with hobbyist "No" and age "At least 30". This is a problem for a proportion test, since it requires that every group has at least ten observations. I used `count(…, .drop)` argument to include subgroups with a zero count. **A dataset where some group's counts are much larger than others is called "imbalanced**".

```{r}
stack_overflow %>% 
  count(hobbyist, age_cat, .drop = FALSE)
```

### **Hypotheses**

We can declare hypotheses to test. I'll set a significance level of point-one.

$H_{0}$ : The proportion of hobbyists under 30 is the same as the proportion of hobbyists at least 30.

$H_{A}$ : The proportion of hobbists under 30 is different from the proportion of hobbyists at least 30

```{r}
alpha <- 0.1
```

### **Proceeding with a proportion test regardless**

Let's ignore the small sample problems and run a proportion test. The p-value is $2.403 \times 10^{-5}$ , which is below the significance level. This test suggests that there is enough evidence to reject the null hypothesis. We'll revisit this later.

```{r}
stack_overflow %>%
  prop_test(hobbyist ~ age_cat, order = c("At least 30", "Under 30"), success = "Yes",
            alternative = "two.sided", correct = FALSE)
```

### **A grammar of graphics**

Traditionally, every type of plot has a name, like scatter plot or line plot. In base-R you decide the plot type by specifying arguments to the plot function, or by calling a dedicated function like hist. This is fine until you want to draw a custom plot like a histogram with polar coordinates. If a function to draw that doesn't exist, you are stuck. The beauty of the grammar of graphics system that ggplot2 implements is that plots are broken down into components that can be combined in many ways. It's more flexible creating a new function for every plot type you can think of. What if there was a grammar of hypothesis tests?

| Plot type        | Base-R               | ggplot2                                    |
|------------------|------------------|------------------------------------|
| **Scatter plot** | `plot(, type = "p")` | `ggplot( ) + geom_point( )`                |
| **Line plot**    | `plot(, type = "l")` | `ggplot( ) + geom_line( )`                 |
| **Histogram**    | `hist( )`            | `ggplot( ) + geom_histogram( )`            |
| **Box plot**     | `boxplot( )`         | `ggplot( ) + geom_boxplot( )`              |
| **Bar plot**     | `barplot( )`         | `ggplot( ) + geom_bar( )`                  |
| **Pie plot**     | `pie( )`             | `ggplot( ) + geom_bar( ) + coord_polar( )` |

### **A grammar of hypothesis tests**

The good news is that there is a grammar of hypothesis tests. It was invented by DataCamp instructor Allen Downey, and is called the **"There is only one test"** framework. The framework was implemented in R in the `infer` package that you've used for proportion and chi-square tests. All hypothesis tests can be performed with this code flow, based on specify, hypothesize, generate, and calculate. By changing the arguments to those functions you can run standard tests like t-tests, or devise your own custom tests. We'll cover specify and hypothesize now, and the others in the next video. generate is worth mentioning briefly. It creates simulated data reflecting the null hypothesis. **Using simulation rather than arithmetic equations to get the test statistic is computationally expensive**, so it has only become accessible with modern computing power. However, simulation has a benefit that it works well even when you have small samples or imbalanced data.

^1^ Allen Downey teaches "Exploratory Data Analysis in Python".

```{r}
null_distn <- dataset %>%
  specify() %>%
  hypothesize() %>%
  generate() %>%
  calculate()
```

```{r}
obs_stat <- dataset %>%
  specify() %>%
  calculate()
```

```{r}
get_p_value(null_distn obs_stat)
```

### **Specifying the variables of interest**

The Stack Overflow survey contains many variables, but for this test, we only care about two of them. `specify()` works like dplyr's select, returning only the response and explanatory columns.

### **specify()**

To call specify, you pipe from the dataset, and pass a formula with the response on the left and the explanatory variable on the right. In the one sample case, use NULL for the explanatory value. Just as when you called `prop_test()`, you need to tell specify which response value is regarded as a success. specify returns a tibble with attributes noting the response and explanatory variable names.

-   For 2 sample tests, use: `response ~ explanatory`

-   For 1 sample tests use `response ~ NULL`

```{r}
stack_overflow %>%
  specify(hobbyist ~ age_cat, success = "Yes")
```

### **`hypothesize()`**

`hypothesize()` declares the type of null hypothesis. Proportion tests are a special case of the chi-square independence test, so we choose "independence". hypothesize doesn't calculate anything; it simply adds another attribute to the dataset.

-   For 2 sample tests, use `"independence"` or `"point"`

-   For 1 sample tests, use `"point"`

```{r}
stack_overflow %>%
  specify(hobbyist ~ age_cat, success = "Yes") %>%
  hypothesise(null = "independence")
```

## **Continuing the infer pipeline**

Previously we saw part of the infer pipeline for simulation-based hypothesis tests.

### **Recap: hypotheses and dataset**

We were testing the hypothesis that the proportion of hobbyists was the same for two age categories. However, this subset of the Stack Overflow survey was imbalanced, violating the assumptions of the traditional proportion test.

$H_{0}$ : The proportion of hobbiyists under 30 is the same as the prop´n of hobbyists at least 30.

$H_{A}$ : The proportion of hobbyists under 30 is different from the prop´n of hobbyists at least 30.

```{r}
alpha <- 0.1
```

### **Recap: workflow**

You saw the complete workflow for a custom hypothesis test. So far, we'd done the specify and hypothesize steps, resulting in a tibble with two columns and some extra attributes describing the response and explanatory variables, and the type of null hypothesis.

```{r}
null_distn <- dataset %>%
  specify() %>%
  hypothesize() %>%
  generate() %>%
  calculate()
```

```{r}
observed_stat <- dataset %>%
  specify() %>%
  calculate()
```

```{r}
get_p_value(null_distn, observed_stat)
```

### **Motivating `generate()`**

If we assume the null hypothesis is true, then it shouldn't matter which hobbyist value is matched up to each age category, since the proportion of hobbyists is the same in each age category. We can generate a simulated dataset by shuffling the response hobbyist values while keeping the explanatory age category values the same.

$H_{0}$ : The proportion of hobbiyists under 30 is the same as the prop´n of hobbyists at least 30.

if $H_{0}$ is true, then

-   In each row, the hobbist value could have appeared with either age category with equal probability.

-   To simulate this, we can permute (shuffle) the hobbyist values while keeping the age categories fixed

### **One permutation**

Conceptually, this is how generate shuffles the dataset. It selects the response column, randomly samples the whole column without replacement, then binds it back to the explanatory column. Compare the original dataset on the left, and the shuffled dataset on the right. The hobbyist values have changed but the age category values have not.

```{r}
bind_cols(stack_overflow %>%
            select(hobbyist) %>%
            slice_sample(prop = 1),
          stack_overflow %>%
            select(age_cat)
          )
```

### **Generating many replicates**

Since randomness is used in the shuffling step, we can't just create one simulated dataset, or the results would be unreliable. `generate()` performs the simulation step many times. Each simulated dataset is called a **replicate** and represents an example of what we might expect the two columns to look like in a universe where the null hypothesis is true.

### **`generate()`**

To call `generate()`, tell it how many replicates you want. For independence tests, the generation type should always be "permute". For convenience, generate combines all the simulated datasets into a single tibble. This is big! It has as many rows as the original dataset, times the number of replicates.

-   For "independence" null hypothesis, set `type` to `"permute"`.

-   For "point" null hypotheses, set `type` to `"bootstrap"` or `"simulate"`

```{r}
stack_overflow %>%
  specify(hobbyist ~ age_cat, success = "Yes") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute")
```

### **Calculating the test statistic**

For each replicate, we calculate the test statistic, in this case the difference in proportions of hobbyists between the two age categories. 5000 replicates gives 5000z differences in proportions. We have a distribution of test statistics. This is known as the null distribution.

### **`calculate()`**

To use the difference in proportions as the test statistic, set the stat argument to "diff in props". We need to tell calculate which proportion to subtract from which by setting the order.

^1^ The ?calculate help page lists all possible test statistics.

`calculate()` calculates a distribution of test statistics known as the *null distribution*

```{r}
null_distn <- stack_overflow %>%
  specify(hobbyist ~ age_cat, success = "Yes") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("At least 30", "Under 30"))

null_distn
```

### **Visualizing the null distribution**

To visualize this null distribution as a histogram, call visualize. This histogram doesn't have a normal bell curve. That means the assumptions for the proportion test didn't hold. Also notice that the test statistic only takes nine distinct values.

```{r}
visualise(null_distn)
```

### **Calculating the test statistic on the original dataset**

To calculate a p-value, we need to compare the null distribution to the test statistic from the original dataset.

### **Observed statistic: `specify() %>% calculate()`**

To calculate the test statistic on the original sample dataset, you reuse the specify and calculate steps. Copy and paste the null distribution code, then remove the hypothesize and generate steps. Here, the observed statistic is -0.07.

```{r}
obs_stat <- stack_overflow %>%
  specify(hobbyist ~ age_cat, success = "Yes") %>%
  # hypotesize(null = "independence") %>%
  # generate(reps = 5000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("At least 30", "Under 30"))

obs_stat
```

### **Visualizing the null distribution vs. the observed stat**

Here's the null distribution histogram with a vertical line added at the observed statistic. The observed statistic is at one edge of the distribution. Does this make it different enough from the null distribution that we should reject the null hypothesis? We'll need to calculate the p-value to find out.

```{r}
visualise(null_distn) +
  geom_vline(aes(xintercept = stat), data = obs_stat, color = "red")
```

### **Get the p-value**

To get the p-value, call `get_p_value()`, passing the null distribution and observed statistic. You also need to set the type of alternative hypothesis. **The argument is called direction rather than alternative**, and "two sided" is separated with a space rather than a dot. This time the p-value is point-one-five, which is greater than the significance level of point-one. That means that we should fail to reject the null hypothesis sticking with there being no difference in proportions of hobbyists between age categories. Recall the results from the dubious proportion test we used before. That had the opposite conclusion. Hopefully this illustrates the danger of using traditional hypothesis tests when the assumptions aren't met. Although it takes more code, the simulation-based hypothesis tests are more robust against small samples, and will help prevent you reaching poor conclusions.

```{r}
get_p_value(null_distn, obs_stat, direction = "two sided") # not alternative = "two.sided"
```

## **Non-parametric ANOVA and unpaired t-tests**

The simulation-based proportion test we performed using the infer pipeline avoided the assumption that the test statistic is normally distributed.

### **Non-parametric tests**

**Hypothesis tests that don't assume a probability distribution for the test statistic are called non-parametric tests**. There are two types of non-parametric test. The infer pipeline performs simulation-based tests. Other tests use the rank of the data.

Two types of non-parametric tests:

1.  Simulation-based

2.  Rank-based

### **`t_test()`**

Here's a t-test. This tested differences in mean annual compensation for Stack Overflow users depending on whether they first learned to program as a child or as an adult. Rather than calculating it step-by step, We've used t_test from infer. The degrees of freedom are different here because they are calculated using an equation called the **Welch approximation**, instead of taking the number of observations minus two. In practice, the difference isn't important. The p-value is 0.03082 instead of 0. Let's explore some non-parametric alternatives to the t-test, starting with the simulation-based infer pipeline.

$$
H_{0} : \mu_{child} \ - \mu_{adult} \ = 0 \ \ \  H_{A} : \mu_{child} \ - \mu_{adult} \ > 0
$$

```{r}
stack_overflow %>%
  t_test(converted_comp ~ age_first_code_cut, order = c("child", "adult"), alternative = "greater")

```

### **Calculating the null distribution**

Just as you did with the simulation-based proportion test, you start by specifying the variables of interest. It's the same formula passed to t_test: the numeric compensation on the left and the age categories on the right. Next you declare the null hypothesis, **that compensation is independent of the age category**. Thirdly, you **generate permutation replicates of the data**. Finally, you c**alculate the non-standardized test statistic for each replicate**. The stat argument is `"diff in means"`. The order argument is the same as in the call to t_test.

```{r}
# Simulation-based pipeline
null_distn <- stack_overflow %>%
  specify(converted_comp ~ age_first_code_cut) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 5000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("child", "adult")
  )

null_distn
```

```{r}
# t-test, for comparison 
stack_overflow %>%
  t_test(converted_comp ~ age_first_code_cut, order = c("child", "adult"), alternative = "greater")
```

### **Calculating the observed statistic**

To calculate the difference in means observed in the Stack Overflow survey, we reuse the specify and calculate steps from the previous pipeline.

```{r}
# Simulation-based pipeline
obs_stat <- stack_overflow %>%
  specify(converted_comp ~ age_first_code_cut) %>%
  calculate(stat = "diff in means", order = c("child", "adult"))

obs_stat
```

```{r}
# t-test for comparison
stack_overflow %>%
  t_test(converted_comp ~age_first_code_cut, order = c("child", "adult"), alternative = "greater")
```

### **Get the p-value**

Finally, we get the p-value from the null distribution and observed statistic. The direction argument to get_p_value is the same as the alternative argument in t_test. The p-value is 0.0362 rather than 0.0308

```{r}
# Simulation-based pipeline
get_p_value(null_distn, obs_stat, direction = "greater")
```

```{r}
# t-test for comparison
stack_overflow %>%
  t_test(converted_comp ~ age_first_code_cut, order = c("child", "adult"), alternative = "greater")
```

### **Ranks of vectors**

Consider this numeric vector, x. The first value of x, one, is the smallest. The second value, fifteen, is the fifth smallest. **These orderings from smallest to largest are known as the ranks of the elements of x**. You can access them with the rank function. You can avoid assumptions about normally distributed data by performing hypothesis tests on the ranks of a numeric input. ***The Wilcoxon-Mann-Whitney*** test is, very roughly speaking, a t-test on ranked data.

```{r}
x = c(1, 15, 3, 10, 6)
rank(x)
```

### **Wilcoxon-Mann-Whitney test**

You can run a ***Wilcoxon-Mann-Whitney*** test using `wilcox.test()` from base-R. **It accepts a formula and data argument**, though these are swapped compared to the infer functions, so they are less pipe-friendly. alternative sets the type of alternative hypothesis, and correct determines whether or not to apply a continuity correction to the z-scores. Since ranks are integers and the normal distribution is continuous, this correction can improve the approximation. However, for large sample sizes its effect is trivial and not needed. Here, the p-value is shown as less than two times ten to the minus sixteen, less than the significance level.

^1^ Also known as the "Wilcoxon rank-sum test" and the "Mann-Whitney U test".

```{r}
wilcox.test(converted_comp ~ age_first_code_cut, data = stack_overflow, alternative = "greater", correct = FALSE)
```

### **Kruskal-Wallis test**

In the same way that ANOVA extends t-tests to more than two groups, the ***Kruskal-Wallace*** test extends the ***Wilcoxon-Mann-Whitney*** to more than two groups. That is, the ***Kruskal-Wallace*** test is a non-parametric version of ANOVA. You pass a formula with the numerical variable on the left and the categorical variable on the right, and set data to the sample data frame. Again, the p-value here is very small.

```{r}
kruskal.test(converted_comp ~ job_sat, data = stack_overflow)
```
