Calculating P-values

We can calculate \(P\)-values in R by using cumulative distribution functions and inverse cumulative distribution functions (quantile function) of the known sampling distribution.

Cumulative distribution function (CDF)

If the probability density function is given by \(f(x)\), then the cumulative distribution function is given by \(F(x)\) and is defined by

\[ F(x) = \int_{-\infty}^{x} f(t)\, dt \] In other words, \(F(x)\) asks for the accumulated probability up to \(x\). Remembering calculus, this can be visualized as the area under the probability density curve.

For the \(\chi^2\) distribution, \(x\) is always non-negative, so the formula simplifies to

\[ F(x) = \int_{0}^{x} f(t)\, dt \] In the following figure, the left panel corresponds to the \(\chi^2\) probability density function \(f(x)\), while the right panel corresponds to the CDF \(F(x)\).

Complementary cumulative distribution function (CCDF)

The complementary cumulative distribution function \(G(x)\) corresponds to the right-tail of the probability density function, as can be seen from the following picture.

It can also be defined as follows:

\[ G(x) = \int_{x}^{\infty}f(t)\, dt = 1 - F(x). \]

How to calculate P-value

For our example from lecture (Feline High Rise Syndrome), we performed the following chi-squared test.

mydata <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv") %>%
  tbl_df()
(chi2 <- chisq.test(table(mydata), p = rep(1/12, 12)))

    Chi-squared test for given probabilities

data:  table(mydata)
X-squared = 20.664, df = 11, p-value = 0.03703
chi2 <- chi2$statistic

For our example from lecture (Feline High Rise Syndrome), we got the test statistic

\[\chi^2_{11} = 20.66\]

So how do we calculate the \(P\)-value? The definition of the \(P\)-value tells us that

\[ P\mathrm{-value} = \mathrm{Pr[}\chi^{2}_{11} \geq 20.66\mathrm{]}, \] as the \(P\)-value is the probability of getting your observed test statistic or worse in the null distribution. The formula above tells you that the \(P\)-value can be calculated by evaluating the CCDF of the \(\chi_{11}^{2}\) random variable! Here’s the corresponding plot of the area under the PDF that represents the \(P\)-value:

So what are the functions in R corresponding to the PDF, CDF and CCDF of a \(\chi^2\) random variable?

Name R command Uses
PDF dchisq(x, df) -
CDF pchisq(q, df, lower.tail=TRUE) -
CCDF pchisq(q, df, lower.tail=FALSE) Compute \(P\)-values

In these commands, d stands for density, while p stands for the cumulative probability distribution.

Exercise - Calculate P-value

Given this table, compute the \(P\)-value for a \(\chi_{11}^{2}\) test statistic of 20.66.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIyBDb21wdXRlIHRoZSBQLXZhbHVlIGZvciBhbiAxMSBkZWdyZWUtb2YtZnJlZWRvbSBcbiMgIGNoaS1zcXVhcmVkIHRlc3Qgc3RhdGlzdGljIG9mIDIwLjY2LiIsInNvbHV0aW9uIjoiIyMgQ29tcHV0ZSB0aGUgUC12YWx1ZSBmb3IgYW4gMTEgZGVncmVlLW9mLWZyZWVkb20gXG4jICBjaGktc3F1YXJlZCB0ZXN0IHN0YXRpc3RpYyBvZiAyMC42Ni5cbnBjaGlzcSgyMC42NiwgZGY9MTEsIGxvd2VyLnRhaWw9RkFMU0UpIiwic2N0IjoidGVzdF9vdXRwdXRfY29udGFpbnMoXCJwY2hpc3EoMjAuNjYsIGRmPTExLCBsb3dlci50YWlsPUZBTFNFKVwiLCBpbmNvcnJlY3RfbXNnID0gXCJPb3BzISAgVHJ5IGFnYWluLlwiKVxuc3VjY2Vzc19tc2coXCJTd2VldCFcIikifQ==

Calculating critical values

Calculating critical values is the reverse process of calculating \(P\)-values. To calculate \(P\)-values, we have a test statistics as input, and want the area under the curve (probability) as output. Now, for critical values, we start with area under the curve as input and want to know as output at what value of the test statistic do we achieve that given probability.

More specifically, the probability that we use as input is the significance level \(\alpha\) (in many cases 0.05). The critical value associated with that is denoted \(\chi^{2}_{df,\alpha}\) and is shown in the following figure.

In this case, the critical value is given by \[\chi^{2}_{11, 0.05} = 19.6751376\]

Here’s the corresponding CCDF.

Complementary quantile function (CQF)

By looking at the CCDF, what we notice is we are going from the vertical axis (probability) to the horizontal axis (test statistic). This is the inverse of the CCDF, which is called the complementary quantile function (CQF), and is given by the following graph.

We also have the quantile function (QF), which is defined as the inverse of the cumulative distribution function, i.e. \(Q = F^{-1}\). The quantile function can be used to calculate all quantiles of a distribution, including medians and quartiles. The 3rd panel in the following figure shows the quantile function. Make sure to compare its graph to the 2nd panel, which is the cumulative distribution function.

So, again, what are the functions in R corresponding to the CQF (and quantile function, QF) of a \(\chi^2\) random variable?

Name R command Uses
QF qchisq(p, df, lower.tail=TRUE) -
CQF qchisq(p, df, lower.tail=FALSE) Compute critical values

In these commands, p stands for probability (the significance level in this case) and q stands for quantile function.

Exercise - Calculate critical value

Given this table, compute the critical value \(\chi^{2}_{11, 0.05}\).

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIyBDb21wdXRlIHRoZSBjcml0aWNhbCB2YWx1ZSBmb3IgYW4gMTEgZGVncmVlLW9mLWZyZWVkb20gXG4jICBjaGktc3F1YXJlZCBkaXN0cmlidXRpb24gd2l0aCBhbHBoYSA9IDAuMDUuIiwic29sdXRpb24iOiIjIyBDb21wdXRlIHRoZSBjcml0aWNhbCB2YWx1ZSBmb3IgYW4gMTEgZGVncmVlLW9mLWZyZWVkb20gXG4jICBjaGktc3F1YXJlZCBkaXN0cmlidXRpb24gd2l0aCBhbHBoYSA9IDAuMDUuXG5xY2hpc3EoMC4wNSwgZGY9MTEsIGxvd2VyLnRhaWw9RkFMU0UpIiwic2N0IjoidGVzdF9vdXRwdXRfY29udGFpbnMoXCJxY2hpc3EoMC4wNSwgZGY9MTEsIGxvd2VyLnRhaWw9RkFMU0UpXCIsIGluY29ycmVjdF9tc2cgPSBcIk9vcHMhICBUcnkgYWdhaW4uXCIpXG5zdWNjZXNzX21zZyhcIlRvdGFsbHkgdHVidWxhciFcIikifQ==

Here’s the graph associated with the previous calculation.

Exercise - Calculate quantiles

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIyBDb21wdXRlIHRoZSBtZWRpYW4gb2YgYSBjaGktc3F1YXJlZCBkaXN0cmlidXRpb24gd2l0aCAxMSBkZWdyZWVzIG9mIGZyZWVkb21cblxuXG4jIyBDb21wdXRlIHRoZSAxc3QgcXVhcnRpbGUgb2YgYSBjaGktc3F1YXJlZCBkaXN0cmlidXRpb24gd2l0aCAxMSBkZWdyZWVzIG9mIGZyZWVkb21cblxuXG4jIyBDb21wdXRlIHRoZSAzcmQgcXVpbnRpbGUgb2YgYSBjaGktc3F1YXJlZCBkaXN0cmlidXRpb24gd2l0aCAxMSBkZWdyZWVzIG9mIGZyZWVkb20iLCJzb2x1dGlvbiI6IiMjIENvbXB1dGUgdGhlIG1lZGlhbiBvZiBhIGNoaS1zcXVhcmVkIGRpc3RyaWJ1dGlvbiB3aXRoIDExIGRlZ3JlZXMgb2YgZnJlZWRvbVxucWNoaXNxKDAuNSwgZGY9MTEsIGxvd2VyLnRhaWw9VFJVRSlcblxuIyMgQ29tcHV0ZSB0aGUgMXN0IHF1YXJ0aWxlIG9mIGEgY2hpLXNxdWFyZWQgZGlzdHJpYnV0aW9uIHdpdGggMTEgZGVncmVlcyBvZiBmcmVlZG9tXG5xY2hpc3EoMC4yNSwgZGY9MTEsIGxvd2VyLnRhaWw9VFJVRSlcblxuIyMgQ29tcHV0ZSB0aGUgM3JkIHF1aW50aWxlIG9mIGEgY2hpLXNxdWFyZWQgZGlzdHJpYnV0aW9uIHdpdGggMTEgZGVncmVlcyBvZiBmcmVlZG9tXG5xY2hpc3EoMC42LCBkZj0xMSwgbG93ZXIudGFpbD1UUlVFKSIsInNjdCI6InRlc3Rfb3V0cHV0X2NvbnRhaW5zKFwicWNoaXNxKDAuNSwgZGY9MTEsIGxvd2VyLnRhaWw9VFJVRSlcbiAgICAgICAgICAgICAgICAgICAgICBxY2hpc3EoMC4yNSwgZGY9MTEsIGxvd2VyLnRhaWw9VFJVRSlcbiAgICAgICAgICAgICAgICAgICAgICBxY2hpc3EoMC42LCBkZj0xMSwgbG93ZXIudGFpbD1UUlVFKVwiLCBpbmNvcnJlY3RfbXNnID0gXCJPb3BzISAgVHJ5IGFnYWluLlwiKVxuc3VjY2Vzc19tc2coXCJUb3RhbGx5IHR1YnVsYXIhXCIpIn0=

Summary of functions for chi-squared

Name R command Uses
PDF dchisq(x, df) Compute density
CDF pchisq(q, df, lower.tail=TRUE) Compute cumulative probability
CCDF pchisq(q, df, lower.tail=FALSE) Compute \(P\)-values
QF qchisq(p, df, lower.tail=TRUE) Compute quantiles
CQF qchisq(p, df, lower.tail=FALSE) Compute critical values

Chi-squared tests

Here’s a quick example of how to do a \(\chi^2\) goodness-of-fit test the quick way. This is based on Whitlock & Schluter, Chapter 8, Question #21.

mydata <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08q21FallingCatsByMonth.csv")
str(mydata)
'data.frame':   119 obs. of  1 variable:
 $ month: Factor w/ 12 levels "April","August",..: 5 5 5 5 4 4 4 4 4 4 ...

Let’s construct the table.

(mytable <- table(mydata))
mydata
    April    August  December  February   January      July      June 
       10        13         5         6         4        19        14 
    March       May  November   October September 
        8         9         7        12        12 

Now perform the test using the chisq.test function.

chisq.test(mytable, p = rep(1/12, 12))

    Chi-squared test for given probabilities

data:  mytable
X-squared = 20.664, df = 11, p-value = 0.03703

The argument p needs to be a probability distribution (i.e. sum to 1). In other words, p is NOT the expected frequencies - p represents the expected relative frequencies.