To test the hypothesis :
\[H_{0} : \text{Probability of success } (p) = p*\]
Against its corresponding one-sided (\(H_{1} : p > p*\) or \(H_{1} : p < p*\)) or two-sided alternative (\(H_{1} : p \neq p*\))
We make use of the function binom.test provided by the mosaic package.
The syntax for binom.test is as follows :
binom.test(
x, # Data to be tested
n = NULL, # Sample size
p, # Hypotesized probability of success (default value is 0.5)
alternative = c("two.sided","less","greater), # Alt hypothesis (default is two.sided)
conf.level = 0.95, # Confidence level of significance to test (default value is 0.95)
# Method for computing confidence interval
ci.method = c("Clopper-Pearson", "binom.test", "Score", "Wilson,
"prop.test", "Wald", "Agresti-Coull", "Plus4")
data, # Dataframe from which the sample "x" is accessed.
success # Level of variable to be considered success
)
We note several things regarding the input arguments for the binom.test function.
c(175,25))alternative = "g" as opposed to alternative = "greater".# There are 18 successes from 30 experiments, and if we assume that the 2 products would be equally likely to be sold, the probability of success is 1/2, thus the test is
mosaic :: binom.test(18,
n=30,
p=0.5,
alternative = "g")
data: 18 out of 30
number of successes = 18, number of trials = 30,
p-value = 0.1808
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.4339453 1.0000000
sample estimates:
probability of success
0.6
Since the p-value is \(0.1808 > 0.05 = \alpha\), then at \(\alpha\) level of significance, we cannot conclude from the experiment, that the new food item will sell better than the old one.
[Quantile testing].
Suppose we have the data of gasoline consumption (in Km/L) of a taxi company with two different types of taxis, one with radial tires, and another with belted tires. We want to test if :
# Read the data
taxi_data
# Median gas consumption of taxi w/ radial tires
rad_med = 4.5 # Hypothesized median
# Under the null hypothesis, the probability of the taxi's gas consumption values > 4.5 and < 4.5 are equal (w/ probability 0.5).
rad_test = taxi_data$radial_tires <= 4.5
# The hypothesis is equivalent to testing
# H0 : P(X<=4.5) >= 0.5
# H1 : P(X<=4.5) < 0.5
# Test the hypothesis
mosaic :: binom.test(rad_test,
p = 0.5,
alternative = "l",
conf.level = 0.95)
data: rad_test [with success = TRUE]
number of successes = 2, number of trials = 16, p-value
= 0.00209
alternative hypothesis: true probability of success is less than 0.5
95 percent confidence interval:
0.0000000 0.3438252
sample estimates:
probability of success
0.125
Since the p-value is \(0.00209 < 0.05\), thus at the significance level of \(0.05\), we can conclude that the median gas consumption of taxis with radial tires are greater than 4.5 Km/L.
# Q1 of gas consumption of taxis w/ belted tires
belt_q1 = 5.5 # Hypothesized Q1
# if Y is the gas consumption of taxis with belted tires, the hypothesis is equivalent to
# H0 : P(Y <= 5.5) <= 0.25
# H1 : P(Y <= 5.5) > 0.25
# We assign the observation w/ values <= 4.5 as TRUE, and else, as FALSE
belt_test = taxi_data$belted_tires <= belt_q1
# Hypothesis test
mosaic :: binom.test(belt_test,
p = 0.25,
alternative = "g",
conf.level = 0.95)
data: belt_test [with success = TRUE]
number of successes = 7, number of trials = 16, p-value
= 0.07956
alternative hypothesis: true probability of success is greater than 0.25
95 percent confidence interval:
0.2266916 1.0000000
sample estimates:
probability of success
0.4375
Since the p-value is \(0.07956 > 0.05\), we conclude that at the significance level of \(0.05\), the null hypothesis that the Q1 of gas consumption of taxis w/ belted tires is \(\geq\) 5.5 Km/L is not rejected.
To perform a sign test, we use the function SignTest from the DescTools package.
The syntax for SignTest is as follows
SignTest(x,
y = NULL,
mu = 0,
alternative = c("two.sided","less","greater"),
conf.level = 0.95)
The input arguments are :
The function will return a list of class htest which contains the summary of the hypothesis test (the statistic value, p-value, etc)
From the taxi data, we will use the sign test to test (at significance level \(0.05\)) the hypotheses :
the median gas consumption of taxis with radial tires is equal to 4.5 Km/L against the hypothesis that it is greater than 4.5 Km/L
the median gas consumption of taxis with belted tires is equal to 5 Km/L against the hypothesis that it is not equal to 5 Km/L
if the tire type of the taxi affects the median gas consumption of the taxis. (Test for the median of the difference of gas consumption)
# 1a. Median gas consumption of taxis w/ radial tires
md_1a = 4.5 # Hypothesized median
# Test the hypothesis
SignTest(taxi_data$radial_tires,
mu = md_1a,
alternative = "g", # Alternative hypothesis is > md_1a
conf.level = 0.95)
One-sample Sign-Test
data: taxi_data$radial_tires
S = 14, number of differences = 15, p-value = 0.0004883
alternative hypothesis: true median is greater than 4.5
96.2 percent confidence interval:
4.9 Inf
sample estimates:
median of the differences
5.85
From the output, we can gather the following information :
* S : the statistic which denotes the # of positive differences between the data and the hypothesized median (how many data points are > hypothesized median)
* number of differences : How many data points which are different from the hypothesized median. (data points which are equal to the hypothesized median are removed)
* sample estimate of the median of the difference : the sample median which is the estimate for the distribution's median.
Since the p-value for the statistic is \(0.0004883 < 0.05\), we conclude at the significance level of \(0.05\) that the median gas consumption of taxis with radial tires is greater than 4.5 Km/L.
* Note that the p-value is the same as performing a binomial test for quantile testing the median, where the data point which is equal to the hypothesized median, is removed.
# 1b. Median of taxis with belted tires.
md_1b = 5 # Hypothesized median
# Test the hypothesis
SignTest(taxi_data$belted_tires,
mu = md_1b,
alternative = "t", # Alternative hypothesis is != 5
conf.level = 0.95
)
One-sample Sign-Test
data: taxi_data$belted_tires
S = 10, number of differences = 16, p-value = 0.4545
alternative hypothesis: true median is not equal to 5
97.9 percent confidence interval:
4.9 6.8
sample estimates:
median of the differences
5.75
Since the p-value for the statistic is \(0.4545 > 0.05\), we conclude that the null hypothesis cannot be rejected at \(0.05\) level of significance.
# 1c. median of the difference
md_1c = 0 # Hypothesized median of the difference
# Test the hypothesis
SignTest(x = taxi_data$radial_tires,
y = taxi_data$belted_tires,
mu = md_1c,
alternative = "t",
conf.level = 0.95
)
Dependent-samples Sign-Test
data: taxi_data$radial_tires and taxi_data$belted_tires
S = 11, number of differences = 14, p-value = 0.05737
alternative hypothesis: true median difference is not equal to 0
97.9 percent confidence interval:
0.0 0.4
sample estimates:
median of the differences
0.1
Since the p-value is \(0.05737 > 0.05\), we conclude that at \(0.05\) level of significance, the null hypothesis that there exist no significant difference in the median gas consumption of the 2 types of taxis cannot be rejected.
Another way to do the sign test on differences of 2 samples (2-sample sign test) is to first calculate the differences between the two samples. An example of this is
# 1c
# Calculates the difference of the 2 sample first
d = taxi_data$radial_tires - taxi_data$belted_tires
# Test the hypothesis
SignTest(d,
mu = md_1c,
alternative = "t",
conf.level = 0.95)
One-sample Sign-Test
data: d
S = 11, number of differences = 14, p-value = 0.05737
alternative hypothesis: true median is not equal to 0
97.9 percent confidence interval:
0.0 0.4
sample estimates:
median of the differences
0.1
Note that this gives the same result as before.
To calculate the Spearman’s rank correlation coefficient of 2 variables from a given sample, we use the function cor from the stats package. The package is usually already installed along with R.
The syntax for the cor function is as follows :
cor(x,y,
method = c("pearson","kendall","spearman"),
use = "everything")
The input arguments are :
We can also test the hypothesis : \[
H_{0} : \text{The Spearman's Rank Correlation Coefficient } (\rho) = 0
\] against its one-sided or two-sided alternatives by using the function cor.test also from the stats package.
The syntax is similar to the cor function :
cor.test(x,y,
method = c("pearson","kendall","spearman"),
alternative = c("two.sided","less","greater"),
conf.level = 0.95)
where alternative is the alternative hypothesis, and conf.level is the confidence level for the confidence interval.
rating_price_data
We can plot the data to see visually if there might be indication of any correlation between the rating and the price
library(ggplot2)
ggplot(rating_price_data) +
geom_point(mapping = aes(x=Rating,y=Price),color = '#2980B9')
It seems as if the Rating and Price are not very much correlated. We now calculate the Spearman rank correlation coefficient for Rating and Price
# Spearman rank correlation coefficient
cor(rating_price_data$Rating,
rating_price_data$Price,
method="s")
[1] -0.4666667
There appears to be a slight negative correlation, we can test the significance of this result using cor.test as follows
# Hypothesis testing for significance of correlation
cor.test(rating_price_data$Rating,
rating_price_data$Price,
method="s",
alternative = "t",
)
Spearman's rank correlation rho
data: x and y
S = 176, p-value = 0.2125
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.4666667
We can see that the p-value for the statistic is \(0.2125 > 0.05\). Thus we conclude that at \(0.05\) level of significance, the hypothesis that rating and price are uncorrelated cannot be refuted.
cars dataset, we will compute the Spearman rank correlation coefficient between a car’s speed and its distance.cars
# Spearman rank correlation coefficient
cor(cars$speed,cars$dist,method="s")
[1] 0.8303568
# Hypothesis testing
cor.test(cars$speed,
cars$dist,
method = "s",
alternative = "t",
)
Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: x and y
S = 3532.8, p-value = 8.825e-14
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8303568
Note that calculating the Spearman rank correlation coefficient using the cor might be faulty since it does not handle with ties in the data very well.
To perform the Wilcoxon Signed Rank Test for testing the median for a single sample, or the median of difference in two-paired samples, we use the function wilcox.test from the stats package.
The syntax is as follows :
wilcox.test(x,
y = NULL,
alternative = c("two.sided","less","greater"),
mu = 0,
paired = FALSE,
exact = NULL,
conf.int = FALSE,
conf.level = 0.95)
The input arguments are :
[Single sample median testing]
We will test using the taxi data, the same hypothesis as in the sign test, but using Wilcoxon Signed Rank Test.
the median gas consumption of taxis with radial tires is equal to 4.5 Km/L against the hypothesis that it is greater than 4.5 Km/L
the median gas consumption of taxis with belted tires is equal to 5 Km/L against the hypothesis that it is not equal to 5 Km/L
# 1a. Median of taxis with radial tires
w1a = wilcox.test(taxi_data$radial_tires,
mu = 4.5,
alternative = "g",
exact = T, # Calculate the p-value using the exact distribution if possible
)
cannot compute exact p-value with tiescannot compute exact p-value with zeroes
w1a
Wilcoxon signed rank test with continuity correction
data: taxi_data$radial_tires
V = 118, p-value = 0.0005433
alternative hypothesis: true location is greater than 4.5
The test gives the Wilcoxon statistic, given as ```V``` in the output, and the corresponding p-value for the value of the statistic. Note that the function approximates the computation of the p-value since there are ties in the data. Also note that the Wilcoxon statistic that is computed (```V```) is the sum of ranks of observation w/ positive differences with respect to the hypothesized median. Thus, if we might want to consult with a Wilcoxon statistic table, we might have to calculate the Wilcoxon statistic corresponding to negative differences, which we can compute manually as follows
x = taxi_data$radial_tires
n = length(x[x != 4.5]) # Sample size of data omitting the observation that is equal to the hypothesized median
w_neg = n*(n+1)/2 - w1a$statistic # The sum of ranks of observation w/ negative differences
w_neg
V
2
# 1b. Median of the taxi with belted tires
w1b = wilcox.test(taxi_data$belted_tires,
mu = 5,
alternative = "t",
exact = T)
cannot compute exact p-value with ties
w1b
Wilcoxon signed rank test with continuity correction
data: taxi_data$belted_tires
V = 111, p-value = 0.02785
alternative hypothesis: true location is not equal to 5
We see that the p-value for this test statistic is (approximately) \(0.02785 < 0.05\), thus at \(0.05\) level of significance, we can conclude that the median gas consumption of taxis with belted tires is not equal to 5 Km/L. We can also crosscheck this result by comparing the test statistic with the critical value from a Wilcoxon signed-rank statistic table.
# Calculating the negative difference Wilcoxon statistic
y = taxi_data$belted_tires
ny = length(y[y!=5]) # Sample size of data without the observation equalling the hypothesized median
wy_neg = ny*(ny+1)/2 - w1b$statistic # The negative difference Wilcoxon statistic
wy_neg
V
25
[Paired-sample tests]
Consider the following data regarding joggers’ systolic blood pressure before and after an 8-kilometer run. We will test the hypothesis that the median systolic blood pressure increases by 8 points against the alternative that the increase is less than 8 points
jogger_data
# Hypothesis test
wilcox.test(jogger_data$After,
jogger_data$Before,
mu = 8, # Hypothesized median
paired = T, # Because we are dealing with paired data
alternative = "l"
)
cannot compute exact p-value with tiescannot compute exact p-value with zeroes
Wilcoxon signed rank test with continuity correction
data: jogger_data$After and jogger_data$Before
V = 15, p-value = 0.01765
alternative hypothesis: true location shift is less than 8
Since the p-value is \(0.01765 < 0.05\), we conclude that at \(0.05\) level of significance, that the increase in median systolic blood pressure is less than 8 points.
To perform the Wilcoxon rank sum test on two sample (unpaired) data, we can use the wilcox.test function as before, or use other functions such as wilcox_test from the coin package. ***The only difference is the value of the statistic that is presented, using wilcox.test will produce the U statistic, and using wilcox_test, we can get the W statistic.
library(coin) # Load the coin package (install it first)
Loading required package: survival
For using the wilcox.test function, the syntax is :
wilcox.test(x,
y,
mu = 0,
paired = FALSE,
alternative = c("two.sided","less","greater"),
exact = NULL,
correct = TRUE,
...)
The input arguments are the same as before, although to conduct a Wilcoxon Rank-Sum test, make sure that paired is set to FALSE (which is the default).
This will test the null hypothesis of \[ H_{0} : \widetilde{\mu}_x - \widetilde{\mu}_y = 0 \] against the corresponding alternative.
For the wilcox_test function, the syntax is :
wilcox_test(formula,
data,
subset = NULL,
ties.method = c("mid-ranks","average-scores"),
distribution = c("asymptotic","approximate","exact"),
...)
The input arguments are :
These input arguments can also be used for the wilcox.test function.
Note that for using this function, the data needs to be in a certain shape, that is there is a column of numeric data, and a column of groups, specifying from which group/level the particular data point is from.
So, in order to do tests with our data, we need to reshape the data as follows.
# Suppose the original data is
car_rating_data[,1:2]
To change the shape of the data, we do the following procedure
library(dplyr) # The library needed to reshape the data
D1 = na.omit(stack(car_rating_data[,1:2])) # Stacks the numerical value from the original dataframe and omits any NA data point
colnames(D1) = c("values","groups") # Gives names for the columns of the new dataframe
# The result is
D1
Now we can apply the wilcox.test and wilcox_test function on the data.
# Using wilcox.test
wilcox.test(values~groups, # The formula
data = D1, # The dataframe to reference the data
alternative = "t"
)
cannot compute exact p-value with ties
Wilcoxon rank sum test with continuity correction
data: values by groups
W = 1.5, p-value = 0.008367
alternative hypothesis: true location shift is not equal to 0
# The null hypothesis is difference of median of group A - median of group B is 0
From the output, we can gather that the Mann-Whitney U statistic corresponding to the base level group (A) is W = 1.5 and the (approximate) p-value for this statistic value is \(0.008367 < 0.05\). So, we conclude that at \(0.05\) level of significance, there is a significant difference across the two groups.
# Using wilcox_test
wrs = wilcox_test(values~groups, # The formula
data = D1, # The dataframe
alternative = "t",
distribution = "e", # Calculate the p-value using the exact distribution
ties.method = "a" # Using average scores for ties
)
wrs
Exact Wilcoxon-Mann-Whitney Test
data: values by groups (A, B)
Z = -2.7193, p-value = 0.008658
alternative hypothesis: true mu is not equal to 0
We can see that the p-value is different than the one that is produced by the wilcox.test function, this is because the calculation for the p-values might differ (for more information, refer to the documentation). Although, at \(0.05\) level of significance, the null hypothesis is still rejected. Note also that the statistic that is presented on the output of the wilcox_test is the “normalized” statistic, we can get the Wilcoxon rank sum statistic from this test by calling it specifically, for example
# The wilcoxon rank-sum statistic
w_statistic = statistic(wrs,"linear") # The wilcoxon rank sum statistic
w_statistic
A 22.5
Note that the statistic given is the statistic corresponding to the base level of the groups (that is A), to get the statistic corresponding to the other group, we can relevel the group/factor using the function relevel from the stats package
# Relevel the factors
D1$groups = relevel(D1$groups, ref = "B") # Changes the base level of the groups to "B"
# wilcox.test
wilcox.test(values~groups,data = D1, alternative = "t")
cannot compute exact p-value with ties
Wilcoxon rank sum test with continuity correction
data: values by groups
W = 34.5, p-value = 0.008367
alternative hypothesis: true location shift is not equal to 0
# wilcox_test
wrs_b = wilcox_test(values~groups,data = D1, distribution = "e", ties.method = "a")
w_statistic_b = statistic(wrs_b,"linear")
w_statistic_b
B 55.5
cigarette_data
We will test whether there is a difference in median nicotine content between the 2 brands of cigarettes.
# Reshaping the data
cig_data = na.omit(stack(cigarette_data))
colnames(cig_data) = c("nicotine_content","brand") # Renaming the columns
cig_data
# Hypothesis test
wilcox_test(nicotine_content~brand,
data = cig_data,
alternative = "t", # Alternative hypothesis
distribution = "e", # Use the exact distribution
ties_method = "a" # Use the average ranking for ties
)
Exact Wilcoxon-Mann-Whitney Test
data: nicotine_content by brand (A, B)
Z = 1.5121, p-value = 0.1391
alternative hypothesis: true mu is not equal to 0
Since the p-value is \(0.1391 > 0.05\), we conclude that the null hypothesis that there is no significant difference in median nicotine content between the 2 brands of cigarettes is not rejected at \(0.05\) level of significance.
To perform the Kruskall-Wallis Test, we make use of the function kruskal.test from the stats package.
The syntax is as follows :
kruskal.test(x,g)
or
kruskal.test(formula,
data)
Where the input arguments are :
or
car_rating_data
We will test if there are any differences in the median rating given by the experts for the 3 brands of cars.
# Reshape the data
car_data = na.omit(stack(car_rating_data))
colnames(car_data) = c("rating","brand") # Renaming the columns of the reshaped data
car_data
# Hypothesis testing
kruskal.test(rating~brand,
data = car_data)
Kruskal-Wallis rank sum test
data: rating by brand
Kruskal-Wallis chi-squared = 9.3326, df = 2, p-value =
0.009407
We can see that the (approximate) chi-squared statistic is 9.3326 and the corresponding p-value is \(0.009407 < 0.05\). So we conclude that at \(0.05\) level of significance, the null hypothesis that there are no differences in the median rating given by the experts is rejected.
Before doing the Kruskal-Wallis test, it might be of interest to plot the data to see if visually, there are differences between the groups, we can do this as follows
# Boxplot of ratings for each groups
library(ggpubr)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
ggboxplot(data = car_data,
x = "brand",
y = "rating")
To perform the runs test, we use the function RunsTest from the DescTools package.
library(DescTools) # Loading the randtests library (Install it first)
The syntax is as follows :
RunsTest(x,
alternative = c("t","l","g"),
exact = NULL,
correct = TRUE,
na.rm = FALSE)
The input arguments are :
Note that if x is a numeric vector with more than 2 values, by default, it will be transformed to be a dichotomous vector (logical vector of TRUE and FALSE) by subtracting the median of the sample. If another threshold is to be chosen, say the mean of the sample we can use RunsTest(x > mean(x))
is random
# The data vector
x1 = c("A","B","A","A","B","B","B","B","A","B","B","A","A","A","B","A","B")
# Runs Test
RunsTest(x1)
Runs Test for Randomness
data: x1
runs = 10, m = 8, n = 9, p-value = 0.8186
alternative hypothesis: true number of runs is not equal the expected number
Since the p-value is \(0.8186 > 0.05\), we conclude that at \(0.05\) level of significance, the null hypothesis that the sequence is random is not rejected.
# Data vector (amount of paint thinner)
x2 = c(3.6,3.9,4.1,3.6,3.8,3.7,3.4,4.0,3.8,4.1,3.9,4.0,3.8,4.2,4.1)
# Runs Test (with the median as the threshold)
x2_test = x2[x2 != median(x2)] # omit any data point w/ value equal to the sample median
RunsTest(x2_test > median(x2),exact = T)
Runs Test for Randomness
data: x2_test > median(x2)
runs = 8, m = 7, n = 6, p-value = 0.796
alternative hypothesis: true number of runs is not equal the expected number
Since the computed p-value is \(0.796 > 0.05\), we conclude that at \(0.05\) level of significance, the null hypothesis that the process is random is not rejected.
Given a sample of continuous data, we can calculate the tolerance interval/tolerance limit of the data using the function nptol.int from the tolerance package.
library(tolerance) # Load the package (install it first)
The syntax for the nptol.int function is as follows :
nptol.int(x,
alpha = 0.05,
P = 0.99,
side = 1,
method = c("WILKS","WALD","HM","YM"),
upper = NULL,
lower = NULL)
The input arguments are * x : numeric vector of the data * alpha : the significance level of the tolerance limit (default is 0.05) * P : the proportion of the data (default is 0.99) * side : whether it will calculate the 2-sided or 1 sided tolerance limit * method : the method that will be used to calculate the tolerance limit * upper : the upper bound of the data (if NULL, then max(x) is used) * lower : the lower bound of the data (if NULL, then min(x) is used)
x2 used above.nptol.int(x2,
alpha = 1-0.90, # The significance level is 1-90%
P = 0.75, # The proportion of the data is the 75%
side = 1)
We see that the upper tolerance limit is 4.1
To perform the Kolmogorov-Smirnov Test of equal distribution, we can use the function ks.test from the stats package.
The syntax for ks.test is as follows
ks.test(x,
y, ...,
alternative = c("two.sided","less","greater"),
exact = NULL
)
The input arguments are :
An important thing to note here is that is we want to compare the cdf of x to a known continuous cdf y, then the parameters of the cdf of y need to be known before hand. In the case that the parameters of the null distribution (cdf of y) is unknown and needed to be estimated using the sample statistics, we can use the function LcKS from the KSCorrect package.
x and we want to test if the sample comes from a normally distributed populationFor this example, we will use the cars data, available in R
# The data
x = cars$dist
# Quantile plot of x
qqnorm(x)
# We will test the distribution of x, against the normal distribution
ks.test(x,
"pnorm")
ties should not be present for the Kolmogorov-Smirnov test
One-sample Kolmogorov-Smirnov test
data: x
D = 0.97997, p-value < 2.2e-16
alternative hypothesis: two-sided
We see that if we only input "pnorm" as the compared distribution, ks.test gives us a p-value which is considerably small, and thus the null hypothesis that x is “normally” distributed is rejected (Even though graphically, from the quantile plot, we might not reject that hypothesis). Keep in mind that by only giving the input "pnorm", then the compared distribution is that of a standard normal distribution (\(\mathcal{N} (0,1)\)), thus obviously if we compare a sample from a \(\mathcal{N} (100,1)\) distribution to the standard normal, then ks.test would give a p-value significantly small to reject the hypothesis that the sample from \(\mathcal{N}(100,1)\) is not normally distributed with mean 0 and variance 1. To properly test if x is normally distributed, we need to know the parameters of the population (that is the mean and variance of the distribution), which in most cases, are not known, and thus we need to use an approximation, such as the sample mean and sample variance. For example :
# Testing if x is normally distributed, w/ parameters approximated wih mean(x) and variance of x
ks.test(x,
"pnorm",mean=mean(x),sd = sd(x), # We use sd(x) since that is the parameter that the functions in R use to parameterize the normal distribution
)
ties should not be present for the Kolmogorov-Smirnov test
One-sample Kolmogorov-Smirnov test
data: x
D = 0.12675, p-value = 0.3979
alternative hypothesis: two-sided
We see that now, ks.test gives the p-value \(0.3979 > 0.05\), and thus the null hypothesis that x is normally distributed with mean equal to the sample mean and variance equal to the sample variance, is not rejected at \(0.05\) level of significance.
# The samples
x = rnorm(100,mean = 25, sd = 2) # A normally distributed sample w/ mean = 25 and standard dev = 2
y = rexp(200,rate = 2) # An exponentially distributed sample w/ rate = 2
# Test if x and y are from the same distribution
ks.test(x,y)
Two-sample Kolmogorov-Smirnov test
data: x and y
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
We see that the p-value is significantly small such that we may reject the null hypothesis thatx and y came from the same distribution at a quite small significance level (even at \(0.00000001\) significance, the null hypothesis can still be rejected). Thus, we conclude that x and y do not come from the same distribution.
Lilliefors test for normality provides a correction to Kolmogorov-Smirnov Test when the parameters of the population is not known.
To perform the Lilliefors test for normality, we use the function lillie.test from the nortest package
library(nortest) # Load the library (install it first)
The syntax is as follows :
lillie.test(x)
The input argument is :
Note that the statistic produced by this function will be equal to that if we use ks.test(x,"pnorm",mean=mean(x),sd=sd(x), but the p-value would be different.
cars data, we will test if the distance traveled by the cars is normally distributed# The data
x = cars$dist
# Test
lillie.test(x)
Lilliefors (Kolmogorov-Smirnov) normality test
data: x
D = 0.12675, p-value = 0.04335
Note that the statistic value is the same D = 0.12675 but the p-value is different, but still, at \(0.05\) level of significance, same conclusion that the null hypothesis that x is normally distributed is not rejected.
For other distribution, the function LcKS from the KScorrect package provides the Lilliefors-corrected Kolmogorov Smirnov test for other distributions, such as exponential, uniform, mixture normal, and others. For more information refer to the documentation of the function.
To calculate the Kendall’s Tau statistic between 2 vectors of data, we use the function cor from the stats package.
Given 2 numeric data vectors x,y, to calculate the Kendall’s Tau between the 2 data, we set the argument method = "k" or method = "kendall" in the cor function.
rating and price from the rating_price_data data.rating_price_data
# Kendall's Tau statistic
cor(rating_price_data$Rating,
rating_price_data$Price,
method = "k",
)
[1] -0.3333333
We see that the Kendall’s Tau is -0.3333333. To test the significance of this, we use the cor.test function
# significance of kendall's tau correlation
cor.test(rating_price_data$Rating,
rating_price_data$Price,
method = "k")
Kendall's rank correlation tau
data: x and y
T = 12, p-value = 0.2595
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.3333333
Since the p-value is \(0.2595 > 0.05\), we conclude that at \(0.05\) levle of significance, the null hypothesis that the Kendall’s Tau between the rating and price is 0 (i.e there is no correlation between rating and price according to Kendall’s Tau Statistic), is not rejected .
Given a contingency table, we can calculate the Cramer coefficient (Cramer’s V) between the categories in the rows and the categories in the columns by using the cramer function from the sjstats package.
The syntax for the cramer function is as follows :
cramer(tab)
The input argument is :
If the data at hand is not already a contingency table, we can still calculate the Cramer’s V between 2 variables using the function crosstable_statistics, also from the sjstats package. THe syntax for this function is as follows :
crosstable_statistics(data,
x1,
x2,
statistics = c("auto","cramer","phi","spearman","kendall","pearson","fisher"),
...)
The input arguments are :
x1 and x2efc (Euro Family Care) data. (A dataset regarding the information of caretakers of elderly people in europe)data(efc)
head(efc) # First 6 rows of the data
We will focus on 2 variables, e16sex (Gender of the elder) and c161sex (Gender of the caretaker). We will see if there are any association/correlation between the gender of the caretaker and the gender of the elder.
We can first generate the cross tabulation (contingency table) between the 2 variables by using the CrossTable function from the descr package.
# Load the required package (Install it first)
library(descr)
# Making the crosstabulation
CrossTable(x = efc$e16sex,
y = efc$c161sex)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
===================================
efc$c161sex
efc$e16sex 1 2 Total
-----------------------------------
1 61 234 295
1.273 0.400
0.207 0.793 0.328
0.284 0.342
0.068 0.260
-----------------------------------
2 154 451 605
0.621 0.195
0.255 0.745 0.672
0.716 0.658
0.171 0.501
-----------------------------------
Total 215 685 900
0.239 0.761
===================================
We can see the marginal frequencies for each gender of the caretaker and the gender of the elder.
# Calculate the cramer's V statistic using crosstable_statistics
crosstable_statistics(data = efc, # Dataframe
x1 = e16sex,
x2 = c161sex,
statistics = "c", # Calculate the cramer's V statistic
)
# Measure of Association for Contingency Tables
Chi-squared: 2.2327
Cramer's V: 0.0526
p-value: 0.1351
We see that the Cramer’s V is \(0.0526\), which is quite close to \(0\) and the corresponding p-value is \(0.1351 > 0.05\). So, we conclude that at \(0.05\) level of significance, there is no association/correlation between the gender of the caretaker and the gender of the elder.
Another way to do it is to make a table of contingency table first between the 2 variables, and then applying the cramer function to it.
# Creating the contingency table for the 2 variables
cont_table = table(efc$e16sex,efc$c161sex)
cont_table
1 2
1 61 234
2 154 451
We can calculate the Cramer’s V using the cramer function
cramer(cont_table)
[1] 0.05258249
We get that the Cramer’s V for the 2 variables is \(0.05258249\).
We can calculate the Phi coefficient between 2 binary categorical variables by using the same crosstable_statistics function from the sjstats package as we use in calculating Cramer coefficient. To calculate the Phi coefficient of 2 variables, we set the statistics argument in the crosstable_statistics to "phi".
An analog to the cramer function can also be used to calculate the phi coefficient given a 2x2 contingency table, that is the phi function from sjstats package.
# Calculate the Phi Coefficient
crosstable_statistics(efc,
x1 = e16sex,
x2 = c161sex,
statistics = "phi")
# Measure of Association for Contingency Tables
Chi-squared: 2.2327
Phi: 0.0526
p-value: 0.1351
We can see that the Phi coefficient is \(0.0526\) and the corresponding p-value is \(0.1351 > 0.05\). Thus we conclude that the null hypothesis that there is no association/correlation/dependence between the gender of the caretaker and the elder is not rejected at \(0.05\) level of significance.
We can also use the phi function, by first creating the contingency table
# Creating the contingency table
cont_table_2 = table(efc$e16sex,efc$c161sex)
cont_table_2
1 2
1 61 234
2 154 451
Now we can calculate the Phi coefficient
phi(cont_table_2)
[1] 0.05258249
To perform the Cochran’s Test, we use the function cochran.qtest from the RVAideMemoire package
The syntax for the cochran.qtest is as follows :
cochran.qtest(formula,
data,
alpha = 0.05,
)
The input arguments are :
breakfast
We would like to test if there are any beverage which is more preferred than the others. To do this, we conduct a Cochran’s Test.
Here the response vector (y) is response, the groups are beverage and the block is people.
# Change the data type of beverage and response to factor
breakfast$beverage = as.factor(breakfast$beverage)
breakfast$response = as.factor(breakfast$response)
breakfast
# Cochran test
cochran.qtest(response~beverage | people,
data = breakfast,
alpha = 0.05)
Cochran's Q test
data: response by beverage, block = people
Q = 3.3333, df = 2, p-value = 0.1889
alternative hypothesis: true difference in probabilities is not equal to 0
sample estimates:
proba in group coffee proba in group juice
0.6470588 0.3529412
proba in group tea
0.3529412
We see that the statistic value is \(3.3333\) and the corresponding p-value is \(0.1889 > 0.05\). Thus we conclude that at \(0.05\) level of significance, the null hypothesis that there is no difference in people’s preference of beverage is not rejected.
(from : https://rcompanion.org/handbook/H_07.html#:~:text=Cochran’s%20Q%20test%20is%20an,multiple%20categories%20with%20paired%20responses.&text=qtest%20which%20will%20conduct%20Cochran’s%20Q%20test%20on%20long%2Dformat%20data.)
students
We want to test if there are any activities which are more appealing for the students to do. To test this, we will conduct the Cochran’s Test, with Practice as groups, Student as the blocks, and Response as the response vector
# Changing the response vector and the groups as factor
students$Practice = as.factor(students$Practice)
students$Response = as.factor(students$Response)
# Cochran's Test
cochran.qtest(Response ~ Practice | Student,
data = students,
alpha = 0.05)
Cochran's Q test
data: Response by Practice, block = Student
Q = 16.9535, df = 3, p-value = 0.0007225
alternative hypothesis: true difference in probabilities is not equal to 0
sample estimates:
proba in group Clippings proba in group Irrigation
0.5714286 0.1428571
proba in group MowHeight proba in group SoilTest
0.8571429 0.7857143
Pairwise comparisons using Wilcoxon sign test
Clippings Irrigation MowHeight
Irrigation 0.0625 - -
MowHeight 0.3281 0.03516 -
SoilTest 0.5438 0.03516 1
P value adjustment method: fdr
Since the p-value is \(0.0007225 < 0.05\), we conclude that there is a difference in students’ preferred activities (Some activities have more probability that the students will be willing to do than others).
Note that the function also produces a pairwise comparison between the groups using pairwise Wilcoxon Sign Test if the null hypothesis of the Cochran’s Test is rejected at alpha level. Another alternative (that is more frequently done), is to do pairwise McNemar Tests on each group to test which group have a higher probability compared to the other.
To do pairwise McNemar Tests, we make use of the function pairwiseMcnemar from the rcompanion package
The syntax for the function pairwiseMcnemar is as follows
pairwiseMcnemar(formula,
data,
x,
g,
block,
test,
...
)
The input arguments are :
cochran.qtest, x is the response vector, g is the group, and b is the block."exact", "mcnemar", or "permutation". “exact” denotes an exact test of symmetry analogous to McNemar test, "mcnemar" denotes the McNemar test of symmetry, and "permutation" denotes a permutation test analogous to the McNemar test.The input data should be either including formula and data, or x,g, and block.
# Pairwise McNnmar test on the students data
pair_mcnemar = pairwiseMcnemar(Response ~ Practice | Student,
data = students,
test = "mcnemar")
pair_mcnemar
$Test.method
$Adustment.method
$Pairwise
NA
We can gather the p-value for each comparison. At \(0.05\) level of significance, we can see that the probability that the students prefer to do the activity “Irrigation” is significantly different compared to the activity “Clippings”, “MowHeight” and “SoilTest”.
To do Theil-Sen regression, we make use of the function mblm from the mblm package
The syntax for the mblm function is
mblm(formula,
dataframe,
repeated = TRUE)
The input arguments are :
Similar to the lm function, the mblm function returns a list of items regarding the model (such as residuals, parameter estimates, etc)
data = theilsenreg
data
where y is the response vector and x is the regressor
We will create a Theil Sen regression model for this data
# Create the theil sen regression model
ts_model = mblm(y~x,
data = data)
# Summary of the theil-sen regression
summary(ts_model)
cannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with tiescannot compute exact p-value with ties
Call:
mblm(formula = y ~ x, dataframe = data)
Residuals:
Min 1Q Median 3Q Max
-2.8310 -1.1370 -0.1368 1.6025 23.5975
Coefficients:
Estimate MAD V value Pr(>|V|)
(Intercept) 19.6495 3.3022 209 0.000111 ***
x 2.0876 0.2928 210 9.54e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.401 on 18 degrees of freedom
We see that the slope parameter’s p-value (the one under the Pr(>|V|) on the row x) is less than \(0.05\), thus we conclude that the variable x is significant in explaining the variability in y (in other words, the regression is significant).
We will now compare this regression model to the ordinary least squares regression model (simple linear regression model).
# Ordinary least squares regression model (simple linear regression)
ols_model = lm(y~x, data = data)
# Creating a data frame to store the predicted values of each model
reg_data = data.frame(x = data$x, y = data$y,
ts_pred = ts_model$fitted.values, # prediction of theil-sen model
ols_pred = ols_model$fitted.values # prediction of simple linear regression model)
)
ggplot(reg_data,aes(x = x, y= y)) +
geom_point(size = 1) +
geom_line(aes(y = ts_pred,color = "Theil-Sen Regression")) +
geom_line(aes(y = ols_pred,color = "Simple Linear Regression"),linetype="dashed") +
scale_color_manual(values = c("Simple Linear Regression" = "darkblue","Theil-Sen Regression" = "red")) +
labs(color = "Regression Model")
Note that the Theil-Sen regression model is less affected by outliers in the data, where as the simple linear regression model is affected by outliers, the (estimated) slope of the simple linear regression model is slightly larger than the Theil-Sen one, since it is being “pulled” by the 2 outliers. Thus, the nonparametric Theil-Sen regression is more robust towards outliers.
Given a single sample of data, to estimate a parameter of the distribution by bootstrapping, we can use the function boot from the boot package
The syntax for the ``boot``` function is as follows :
boot(x,
statistic,
R,
...)
The input arguments are :
x. The function needs to have at least 2 input arguments, since 1 argument is required for the resampling in the bootstrapping processmtcars dataset, where we want to model a car’s miles per gallon as a function of its weight and displacement.# Read the data
boot_data = mtcars
# Define the function needed to calculate the estimated coefficients for the regression model
reg_coef = function(data, formula, i){ # The function takes 3 arguments, data, formula, and i
dat = data[i,] # For the boot to select samples, we will use the sample dat, taken from data, to make an estimate of the regression coefficient
lin_model = lm(formula, data = dat) # create the linear regression model from dat
return(lin_model$coefficients) # Return the coefficients of the regression model
}
# Bootstrapping
set.seed(100) # Set the seed for the bootstrap resampling, so that the result can be reproducible (If this is removed, the result of the bootstrap may vary with each execution)
reg_boot = boot(data = mtcars,
statistic = reg_coef,
R = 5000,
formula = mpg ~ wt + disp # the formula for the reg_coef function
)
reg_boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = reg_coef, R = 5000, formula = mpg ~
wt + disp)
Bootstrap Statistics :
original bias std. error
t1* 34.96055404 0.1513454138 2.488647226
t2* -3.35082533 -0.0929682077 1.130996757
t3* -0.01772474 0.0002886382 0.008380343
We can gather from the output that the estimate for the regression coefficients are :
wtdispeach with their own bias and standard error.
Furthermore, we can produce a confidence interval for these estimates as follows
# Confidence interval for the bootstrap estimates
boot.ci(reg_boot,conf=0.95) # 95% conf. interval for the intercept coefficient
NaNs producedextreme order statistics used as endpointsNaNs produced
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = reg_boot, conf = 0.95)
Intervals :
Level Normal Basic Studentized
95% (29.93, 39.69 ) (30.00, 39.81 ) ( NaN, NaN )
Level Percentile BCa
95% (30.11, 39.92 ) (29.54, 39.53 )
Calculations and Intervals on Original Scale
Warning : Studentized Intervals used Extreme Quantiles
Some studentized intervals may be unstable
boot.ci(reg_boot, conf = 0.95) # 95% conf.interval for tha coefficient corresponding to wt
NaNs producedextreme order statistics used as endpointsNaNs produced
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = reg_boot, conf = 0.95)
Intervals :
Level Normal Basic Studentized
95% (29.93, 39.69 ) (30.00, 39.81 ) ( NaN, NaN )
Level Percentile BCa
95% (30.11, 39.92 ) (29.54, 39.53 )
Calculations and Intervals on Original Scale
Warning : Studentized Intervals used Extreme Quantiles
Some studentized intervals may be unstable
boot.ci(reg_boot, conf = 0.95) # 95% conf.interval for the coefficeint corresponding to disp
NaNs producedextreme order statistics used as endpointsNaNs produced
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates
CALL :
boot.ci(boot.out = reg_boot, conf = 0.95)
Intervals :
Level Normal Basic Studentized
95% (29.93, 39.69 ) (30.00, 39.81 ) ( NaN, NaN )
Level Percentile BCa
95% (30.11, 39.92 ) (29.54, 39.53 )
Calculations and Intervals on Original Scale
Warning : Studentized Intervals used Extreme Quantiles
Some studentized intervals may be unstable
The confidence interval for each coefficient is calculated with 3 methods. For more information, refer to the documentation of the function boot.ci.
To do the jackknife method for estimating a distribution’s parameter by resampling, we can use the function jackknife from the bootstrap package.
library(bootstrap) # Load the library (install it first)
Attaching package: ‘bootstrap’
The following object is masked from ‘package:RVAideMemoire’:
bootstrap
The following object is masked from ‘package:sjstats’:
bootstrap
The syntax is as follows :
jackknife(x,
theta,...)
The input arguments are
x. Note that unlike for the boot function, theta need not have at least 2 input arguments.theta functionmtcars dataset, we want to estimate the mean of car’s miles per gallon by using the jackknife procedure# The data
car_mpg = mtcars$mpg
# Define the mean function for theta
mean_jack = function(x){
return(mean(x))
}
# Jackknife
jack_res = jackknife(x,mean_jack)
jack_res
$jack.se
[1] 3.64434
$jack.bias
[1] 0
$jack.values
[1] 43.81633 43.65306 43.77551 43.40816 43.53061 43.65306
[7] 43.48980 43.32653 43.16327 43.51020 43.28571 43.57143
[13] 43.44898 43.36735 43.28571 43.32653 43.16327 43.16327
[19] 42.91837 43.32653 43.12245 42.63265 42.22449 43.44898
[25] 43.32653 42.75510 43.20408 43.04082 43.20408 43.04082
[31] 42.83673 43.00000 42.71429 42.30612 42.14286 43.12245
[37] 42.91837 42.46939 43.20408 42.87755 42.79592 42.71429
[43] 42.55102 42.51020 42.75510 42.42857 41.97959 41.95918
[49] 41.40816 42.12245
$call
jackknife(x = x, theta = mean_jack)
The function jackknife returns a list of objects, they are the estimates of the standard error of the estimated parameter, bias of the estimated parameter, and the (-i) values of the parameter estimates (estimates of the parameter, with the ith observation removed), which can be used to calculate the jackknife estimate, by taking its mean.
# The jackknife parameter estimate
jack_mean_est = mean(jack_res$jack.values)
jack_mean_est
[1] 42.98
We see that the jackknife estimate for the mean is 42.98, and this estimate has an estimate standard error of 3.64434, and an estimate bias of 0.