library(tibble)
library(tidyverse)
library(vtable)
library(ggplot2)
library(gplots)
library(graphics)
library(vcd)
library(corrplot)
library(gmodels)
FREQUENCY TEST
Loading Packages:
One-Proportion Z-Test in R:
One-proportion Z-test is utilized to contrast a seen proportion with a hypothetical one. This is when there only exist two categories. This article delineates fundamentals of the one-proportion Z-test. It also furnishes practical instances employing R software.
For Example:
For instance we possess a mouse population. It's got half males and half females. This yields a probability of 0.5 or 50%. A number of these mice numbering 160 have developed spontaneous cancer. We have 95 male mice and 65 female mice in this category.
The number of successes (male with cancer) is 95
The observed proportion (popo) of male is 95/160
The observed proportion (qq) of female is 1−po1−po
The expected proportion (pepe) of male is 0.5 (50%)
The number of observations (nn) is 160
R functions: binom.test() & prop.test()
The R functions binom.test() and prop.test() can be used to perform one-proportion test:
binom.test(): compute exact binomial test. Recommended when sample size is small
prop.test(): can be used when sample size is large ( N > 30). It uses a normal approximation to binomial
prop.test(x=95,n=160,p=0.5, correct=FALSE)
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
prop.test(x = 95, n = 160, p = 0.5, correct = FALSE,alternative = "less")
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.9911
alternative hypothesis: true p is less than 0.5
95 percent confidence interval:
0.0000000 0.6555425
sample estimates:
p
0.59375
<- prop.test(x = 95, n = 160, p = 0.5,
res correct = FALSE)
# Printing the results
res
1-sample proportions test without continuity correction
data: 95 out of 160, null probability 0.5
X-squared = 5.625, df = 1, p-value = 0.01771
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5163169 0.6667870
sample estimates:
p
0.59375
# printing the p-value
$p.value res
[1] 0.01770607
# printing the mean
$estimate res
p
0.59375
# printing the confidence interval
$conf.int res
[1] 0.5163169 0.6667870
attr(,"conf.level")
[1] 0.95
What is two-proportions z-test?
The two-proportions z-test is used to compare two observed proportions. This article describes the basics of two-proportions *z-test and provides pratical examples using R software
Example:
we have two groups of individuals:
Group A with lung cancer: n = 500 Group B, healthy individuals: n = 500 The number of smokers in each group is as follow:
Group A with lung cancer: n = 500, 490 smokers, pA=490/500=98 Group B, healthy individuals: n = 500, 400 smokers, pB=400/500=80 In this setting:
The overall proportion of smokers is p=frac(490+400)500+500=89 The overall proportion of non-smokers is q=1−p=11
<- prop.test(x = c(490, 400), n = c(500, 500))
res # Printing the results
res
2-sample test for equality of proportions with continuity correction
data: c(490, 400) out of c(500, 500)
X-squared = 80.909, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.1408536 0.2191464
sample estimates:
prop 1 prop 2
0.98 0.80
# printing the p-value
$p.value res
[1] 2.363439e-19
# printing the mean
$estimate res
prop 1 prop 2
0.98 0.80
# printing the confidence interval
$conf.int res
[1] 0.1408536 0.2191464
attr(,"conf.level")
[1] 0.95
chi-square goodness of fit test:
The chi-square goodness of fit test is used to compare the observed distribution to an expected distribution, in a situation where we have two or more categories in a discrete data. In other words, it compares multiple observed proportions to expected probabilities.
<- c(81, 50, 27)
tulip <- chisq.test(tulip, p = c(1/3, 1/3, 1/3))
res res
Chi-squared test for given probabilities
data: tulip
X-squared = 27.886, df = 2, p-value = 8.803e-07
# printing the p-value
$p.value res
[1] 8.802693e-07
# printing the mean
$estimate res
NULL
Chi-Square Test of Independence in R:
The chi-square test of independence is used to analyze the frequency table (i.e. contingency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables. This article describes the basics of chi-square test and provides practical examples using R software.
Data format: Contingency tables:
# Import the data
<- "http://www.sthda.com/sthda/RDoc/data/housetasks.txt"
file_path <- read.delim(file_path, row.names = 1)
housetasks # head(housetasks)
Graphical display of contengency tables:
Contingency table can be visualized using the function balloonplot(). This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component.
# 1. convert the data as a table
<- as.table(as.matrix(housetasks))
dt # 2. Graph
balloonplot(t(dt), main ="housetasks", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)
Mosaicplot:
library("graphics")
mosaicplot(dt, shade = TRUE, las=2,
main = "housetasks")
- Blue color indicates that the observed value is higher than the expected value.
- Red color specifies that the observed value is lower than the expected value.
Plotting a Subset of the table:
# install.packages("vcd")
library("vcd")
# plot just a subset of the table
assoc(head(dt, 5), shade = TRUE, las=3)
Compute chi-square test in R:
<- chisq.test(housetasks)
chisq chisq
Pearson's Chi-squared test
data: housetasks
X-squared = 1944.5, df = 36, p-value < 2.2e-16
# Observed counts
$observed chisq
Wife Alternating Husband Jointly
Laundry 156 14 2 4
Main_meal 124 20 5 4
Dinner 77 11 7 13
Breakfeast 82 36 15 7
Tidying 53 11 1 57
Dishes 32 24 4 53
Shopping 33 23 9 55
Official 12 46 23 15
Driving 10 51 75 3
Finances 13 13 21 66
Insurance 8 1 53 77
Repairs 0 3 160 2
Holidays 0 1 6 153
# Expected counts
round(chisq$expected,2)
Wife Alternating Husband Jointly
Laundry 60.55 25.63 38.45 51.37
Main_meal 52.64 22.28 33.42 44.65
Dinner 37.16 15.73 23.59 31.52
Breakfeast 48.17 20.39 30.58 40.86
Tidying 41.97 17.77 26.65 35.61
Dishes 38.88 16.46 24.69 32.98
Shopping 41.28 17.48 26.22 35.02
Official 33.03 13.98 20.97 28.02
Driving 47.82 20.24 30.37 40.57
Finances 38.88 16.46 24.69 32.98
Insurance 47.82 20.24 30.37 40.57
Repairs 56.77 24.03 36.05 48.16
Holidays 55.05 23.30 34.95 46.70
Function chisq.test():
round(chisq$residuals, 3)
Wife Alternating Husband Jointly
Laundry 12.266 -2.298 -5.878 -6.609
Main_meal 9.836 -0.484 -4.917 -6.084
Dinner 6.537 -1.192 -3.416 -3.299
Breakfeast 4.875 3.457 -2.818 -5.297
Tidying 1.702 -1.606 -4.969 3.585
Dishes -1.103 1.859 -4.163 3.486
Shopping -1.289 1.321 -3.362 3.376
Official -3.659 8.563 0.443 -2.459
Driving -5.469 6.836 8.100 -5.898
Finances -4.150 -0.852 -0.742 5.750
Insurance -5.758 -4.277 4.107 5.720
Repairs -7.534 -4.290 20.646 -6.651
Holidays -7.419 -4.620 -4.897 15.556
Visualize Pearson residuals using the package corrplot:
corrplot(chisq$residuals, is.cor = FALSE)
Contibution in percentage (%):
<- 100*chisq$residuals^2/chisq$statistic
contrib round(contrib, 3)
Wife Alternating Husband Jointly
Laundry 7.738 0.272 1.777 2.246
Main_meal 4.976 0.012 1.243 1.903
Dinner 2.197 0.073 0.600 0.560
Breakfeast 1.222 0.615 0.408 1.443
Tidying 0.149 0.133 1.270 0.661
Dishes 0.063 0.178 0.891 0.625
Shopping 0.085 0.090 0.581 0.586
Official 0.688 3.771 0.010 0.311
Driving 1.538 2.403 3.374 1.789
Finances 0.886 0.037 0.028 1.700
Insurance 1.705 0.941 0.868 1.683
Repairs 2.919 0.947 21.921 2.275
Holidays 2.831 1.098 1.233 12.445
Visualize the contribution:
corrplot(contrib, is.cor = FALSE)
Excerise 4:
Loading Data:
mpg
# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Using group_by() and summarize():
we can get the total number of cars with each class & cyl combination using group_by() and summarize().
%>%
mpggroup_by(class, cyl)%>%
summarize(n=n())%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | cyl | n |
---|---|---|
2seater | 8 | 5 |
compact | 4 | 32 |
compact | 5 | 2 |
compact | 6 | 13 |
midsize | 4 | 16 |
midsize | 6 | 23 |
midsize | 8 | 2 |
minivan | 4 | 1 |
minivan | 6 | 10 |
pickup | 4 | 3 |
pickup | 6 | 10 |
pickup | 8 | 20 |
subcompact | 4 | 21 |
subcompact | 5 | 2 |
subcompact | 6 | 7 |
subcompact | 8 | 5 |
suv | 4 | 8 |
suv | 6 | 16 |
suv | 8 | 38 |
Crosstabs (dplyr & tidyr):
%>%
mpggroup_by(class, cyl)%>%
summarise(n=n())%>%
spread(cyl, n)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 5 |
compact | 32 | 2 | 13 | NA |
midsize | 16 | NA | 23 | 2 |
minivan | 1 | NA | 10 | NA |
pickup | 3 | NA | 10 | 20 |
subcompact | 21 | 2 | 7 | 5 |
suv | 8 | NA | 16 | 38 |
Summary statistics:
Here instead of displaying frequencies, we can get the average number of city miles by class & cyl.
%>%
mpggroup_by(class, cyl)%>%
summarise(mean_cty=mean(cty))%>%
spread(cyl, mean_cty)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 15.40000 |
compact | 21.37500 | 21 | 16.92308 | NA |
midsize | 20.50000 | NA | 17.78261 | 16.00000 |
minivan | 18.00000 | NA | 15.60000 | NA |
pickup | 16.00000 | NA | 14.50000 | 11.80000 |
subcompact | 22.85714 | 20 | 17.00000 | 14.80000 |
suv | 18.00000 | NA | 14.50000 | 12.13158 |
Max number of city miles by class & cyl:
%>%
mpggroup_by(class, cyl)%>%
summarise(max_cty=max(cty))%>%
spread(cyl, max_cty)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
class | 4 | 5 | 6 | 8 |
---|---|---|---|---|
2seater | NA | NA | NA | 16 |
compact | 33 | 21 | 18 | NA |
midsize | 23 | NA | 19 | 16 |
minivan | 18 | NA | 17 | NA |
pickup | 17 | NA | 16 | 14 |
subcompact | 35 | 20 | 18 | 15 |
suv | 20 | NA | 17 | 14 |
Proportions (dplyr & tidyr):
We can find proportions by creating a new, calculated variable dividing row frequency by table frequency.
%>%
mpggroup_by(class)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>% # our new proportion variable
kable()
class | n | prop |
---|---|---|
2seater | 5 | 0.0213675 |
compact | 47 | 0.2008547 |
midsize | 41 | 0.1752137 |
minivan | 11 | 0.0470085 |
pickup | 33 | 0.1410256 |
subcompact | 35 | 0.1495726 |
suv | 62 | 0.2649573 |
Contingency table of proportion values:
We can create a contingency table of proportion values by applying the same spread command as before. Vary the group_by() and spread() arguents to produce proportions of different variables.
%>%
mpggroup_by(class, cyl)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))%>%
subset(select=c("class","cyl","prop"))%>% #drop the frequency value
spread(class, prop)%>%
kable()
`summarise()` has grouped output by 'class'. You can override using the
`.groups` argument.
cyl | 2seater | compact | midsize | minivan | pickup | subcompact | suv |
---|---|---|---|---|---|---|---|
4 | NA | 0.6808511 | 0.3902439 | 0.0909091 | 0.0909091 | 0.6000000 | 0.1290323 |
5 | NA | 0.0425532 | NA | NA | NA | 0.0571429 | NA |
6 | NA | 0.2765957 | 0.5609756 | 0.9090909 | 0.3030303 | 0.2000000 | 0.2580645 |
8 | 1 | NA | 0.0487805 | NA | 0.6060606 | 0.1428571 | 0.6129032 |
Table():
table() is a quick way to pull together row/column frequencies and proportions for categorical variables.Using the basic table() command, we can get a contingency table of vehicle class by number of cylinders.
table(mpg$class, mpg$cyl)
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
Table, Column, and Row Frequencies:
<- table(mpg$class, mpg$cyl) #define object w/table parameters for simple calling
mpg_tableftable(mpg_table)
4 5 6 8
2seater 0 0 0 5
compact 32 2 13 0
midsize 16 0 23 2
minivan 1 0 10 0
pickup 3 0 10 20
subcompact 21 2 7 5
suv 8 0 16 38
For row frequencies, we use the margin.table() command:
margin.table(mpg_table, 1)
2seater compact midsize minivan pickup subcompact suv
5 47 41 11 33 35 62
For Column:
margin.table(mpg_table, 2)
4 5 6 8
81 4 79 70
Table, Column, and Row Proportions:
For proportion of the entire table.
prop.table(mpg_table) #proportion of entire table
4 5 6 8
2seater 0.000000000 0.000000000 0.000000000 0.021367521
compact 0.136752137 0.008547009 0.055555556 0.000000000
midsize 0.068376068 0.000000000 0.098290598 0.008547009
minivan 0.004273504 0.000000000 0.042735043 0.000000000
pickup 0.012820513 0.000000000 0.042735043 0.085470085
subcompact 0.089743590 0.008547009 0.029914530 0.021367521
suv 0.034188034 0.000000000 0.068376068 0.162393162
For Row:
prop.table(mpg_table, 1) #proportion of entire row
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 1.00000000
compact 0.68085106 0.04255319 0.27659574 0.00000000
midsize 0.39024390 0.00000000 0.56097561 0.04878049
minivan 0.09090909 0.00000000 0.90909091 0.00000000
pickup 0.09090909 0.00000000 0.30303030 0.60606061
subcompact 0.60000000 0.05714286 0.20000000 0.14285714
suv 0.12903226 0.00000000 0.25806452 0.61290323
For column:
prop.table(mpg_table, 2) #proportion of entire column
4 5 6 8
2seater 0.00000000 0.00000000 0.00000000 0.07142857
compact 0.39506173 0.50000000 0.16455696 0.00000000
midsize 0.19753086 0.00000000 0.29113924 0.02857143
minivan 0.01234568 0.00000000 0.12658228 0.00000000
pickup 0.03703704 0.00000000 0.12658228 0.28571429
subcompact 0.25925926 0.50000000 0.08860759 0.07142857
suv 0.09876543 0.00000000 0.20253165 0.54285714
gmodels::CrossTable():
The CrossTable() command from the gmodels package produces frequencies, and table, row, & column proportions with a single command. The values are not as quickly drawn into tables of their own, or further manipulated as they are with the dyplr/tidyr tables, but this is a handy command nonetheless
CrossTable(mpg$class, mpg$cyl)
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 234
| mpg$cyl
mpg$class | 4 | 5 | 6 | 8 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|
2seater | 0 | 0 | 0 | 5 | 5 |
| 1.731 | 0.085 | 1.688 | 8.210 | |
| 0.000 | 0.000 | 0.000 | 1.000 | 0.021 |
| 0.000 | 0.000 | 0.000 | 0.071 | |
| 0.000 | 0.000 | 0.000 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
compact | 32 | 2 | 13 | 0 | 47 |
| 15.210 | 1.782 | 0.518 | 14.060 | |
| 0.681 | 0.043 | 0.277 | 0.000 | 0.201 |
| 0.395 | 0.500 | 0.165 | 0.000 | |
| 0.137 | 0.009 | 0.056 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
midsize | 16 | 0 | 23 | 2 | 41 |
| 0.230 | 0.701 | 6.059 | 8.591 | |
| 0.390 | 0.000 | 0.561 | 0.049 | 0.175 |
| 0.198 | 0.000 | 0.291 | 0.029 | |
| 0.068 | 0.000 | 0.098 | 0.009 | |
-------------|-----------|-----------|-----------|-----------|-----------|
minivan | 1 | 0 | 10 | 0 | 11 |
| 2.070 | 0.188 | 10.641 | 3.291 | |
| 0.091 | 0.000 | 0.909 | 0.000 | 0.047 |
| 0.012 | 0.000 | 0.127 | 0.000 | |
| 0.004 | 0.000 | 0.043 | 0.000 | |
-------------|-----------|-----------|-----------|-----------|-----------|
pickup | 3 | 0 | 10 | 20 | 33 |
| 6.211 | 0.564 | 0.117 | 10.391 | |
| 0.091 | 0.000 | 0.303 | 0.606 | 0.141 |
| 0.037 | 0.000 | 0.127 | 0.286 | |
| 0.013 | 0.000 | 0.043 | 0.085 | |
-------------|-----------|-----------|-----------|-----------|-----------|
subcompact | 21 | 2 | 7 | 5 | 35 |
| 6.515 | 3.284 | 1.963 | 2.858 | |
| 0.600 | 0.057 | 0.200 | 0.143 | 0.150 |
| 0.259 | 0.500 | 0.089 | 0.071 | |
| 0.090 | 0.009 | 0.030 | 0.021 | |
-------------|-----------|-----------|-----------|-----------|-----------|
suv | 8 | 0 | 16 | 38 | 62 |
| 8.444 | 1.060 | 1.162 | 20.403 | |
| 0.129 | 0.000 | 0.258 | 0.613 | 0.265 |
| 0.099 | 0.000 | 0.203 | 0.543 | |
| 0.034 | 0.000 | 0.068 | 0.162 | |
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 81 | 4 | 79 | 70 | 234 |
| 0.346 | 0.017 | 0.338 | 0.299 | |
-------------|-----------|-----------|-----------|-----------|-----------|