This document contains my R code for computations of all of the examples (with chapter and page numbers) from the book “The Manga Guide to Statistics” published by No Starch Press. This document was made with R Markdown that enables easy authoring of dynamic reports with R code and results of its execution. All results are rounded just like in the book.
I would like to thank James Church for his version of R code (www.nostarch.com/download/MangaGuideToStatistics.R) that gave me motivation to write my own variant. He also kindly permitted me to use his code as a starting point of my variant. Take a look at his version if you want only base R code without using additional packages.
In my code I used several popular R packages:
%>% and exposition pipe %$% operators,Make sure that these libraries are installed (by install.packages function or by RStudio IDE) and loaded:
library(magrittr)
library(dplyr)
library(ggplot2)
There are no calculations in this chapter.
Page 33: Table with the prices of the top 50 ramen shops can be presented with the following vector:
ramen.prices <- c(700, 850, 600, 650, 980, 750, 500, 890, 880, 700,
890, 720, 680, 650, 790, 670, 680, 900, 880, 720,
850, 700, 780, 850, 750, 780, 590, 650, 580, 750,
800, 550, 750, 700, 600, 800, 800, 880, 790, 790,
780, 600, 670, 680, 650, 890, 930, 650, 777, 700)
Page 38: Frequency table data can be calculated by built-in hist function (it’s not only for histogram plotting). Let’s try to make a table like the one on page 38:
DistributionTable <- function(hist.data){
interval.lower <- hist.data$breaks[-length(hist.data$breaks)]
interval.upper <- hist.data$breaks[-1]
hist.table <- data.frame(
Interval = paste(interval.lower, "-", interval.upper),
Midpoint = hist.data$mids,
Frequency = hist.data$counts,
Rel.Frequency = hist.data$counts / sum(hist.data$counts)
)
return(hist.table)
}
ramen.hist <- hist(ramen.prices, breaks = 5, right = FALSE, plot = FALSE)
DistributionTable(ramen.hist)
| Interval | Midpoint | Frequency | Rel.Frequency |
|---|---|---|---|
| 500 - 600 | 550 | 4 | 0.08 |
| 600 - 700 | 650 | 13 | 0.26 |
| 700 - 800 | 750 | 18 | 0.36 |
| 800 - 900 | 850 | 12 | 0.24 |
| 900 - 1000 | 950 | 3 | 0.06 |
Page 39: Histogram of ramen prices with ggplot function from ggplot2 package. Resulting plot was tweaked to look like the histogram from the book.
ggplot(NULL, aes(x = ramen.prices)) + theme_bw() +
geom_histogram(binwidth = 100, colour = "black", fill = "gray") +
ggtitle("Histogram of 50 best ramen shops prices") +
xlab("Ramen prices") + ylab("Frequency") +
coord_cartesian(xlim = c(490, 1010), ylim = c(0, 20)) +
scale_x_continuous(breaks = seq(550, 950, 100))
Page 41: Data frame containing scores of three bowling teams:
bowling <- data.frame(
team.A = c(86, 73, 124, 111, 90, 38),
team.B = c(84, 71, 103, 85, 90, 89),
team.C = c(229, 77, 59, 95, 70, 88)
)
bowling
| team.A | team.B | team.C |
|---|---|---|
| 86 | 84 | 229 |
| 73 | 71 | 77 |
| 124 | 103 | 59 |
| 111 | 85 | 95 |
| 90 | 90 | 70 |
| 38 | 89 | 88 |
Page 42: Average scores can be calculated using built-in mean function. And summarise_each function from dplyr package can apply any function to all columns at once:
bowling %>% summarise_each(funs(mean))
| team.A | team.B | team.C |
|---|---|---|
| 87 | 87 | 103 |
Page 43: There is no built-in function for geometric mean, but it’s simple to write special function for that:
GeometricMean <- function(values) {
gmean <- prod(values) ^ (1 / length(values))
return(gmean)
}
bowling %>% summarise_each(funs(GeometricMean)) %>% round(3)
| team.A | team.B | team.C |
|---|---|---|
| 81.614 | 86.478 | 92.062 |
Page 43: Harmonic mean of bowling team scores:
HarmonicMean <- function(values) {
hmean <- 1 / (sum(1 / values) / length(values))
return(hmean)
}
bowling %>% summarise_each(funs(HarmonicMean)) %>% round(3)
| team.A | team.B | team.C |
|---|---|---|
| 75.163 | 85.948 | 85.132 |
Page 45: Median scores for the bowling teams (there is a built-in median function for that):
bowling %>% summarise_each(funs(median))
| team.A | team.B | team.C |
|---|---|---|
| 88 | 87 | 82.5 |
Page 50: Standard Deviation (\(\sigma\)). There is a built-in function sd for computing sample standard deviation: \(\sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i-\overline{x})^2}\) (see page 52). But we can write a custom function for calculating standard deviation for the entire population \(\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i-\overline{x})^2}\):
StdDev <- function(values) {
m <- mean(values)
stddev <- sqrt(sum((values - m) ^ 2) / length(values))
return(stddev)
}
bowling %>% summarise_each(funs(StdDev)) %>% round(1)
| team.A | team.B | team.C |
|---|---|---|
| 27.5 | 9.5 | 57.5 |
Page 55: Sturges’ Rule (using built-in nclass.Sturges function) and breakpoint range of a histogram:
cat("Using Sturges' Rule on ramen.prices vector:\n")
nclass.Sturges(ramen.prices) %>% paste("Number of classes:", ., "\n") %>% cat()
BreakRange <- function(values) {
return(ceiling((max(values) - min(values)) / nclass.Sturges(values)))
}
BreakRange(ramen.prices) %>% paste("Range of class:", ., "\n") %>% cat()
Using Sturges' Rule on ramen.prices vector:
Number of classes: 7
Range of class: 69
Page 56: Frequency table of 50 best ramen shops (Sturges’ Rule):
# vector of breakpoints
ramen.breaks <- seq(min(ramen.prices), by = BreakRange(ramen.prices),
length.out = nclass.Sturges(ramen.prices) + 1)
ramen.breaks
[1] 500 569 638 707 776 845 914 983
ramen.histS <- hist(ramen.prices, breaks = ramen.breaks, right = FALSE, plot = FALSE)
DistributionTable(ramen.histS)
| Interval | Midpoint | Frequency | Rel.Frequency |
|---|---|---|---|
| 500 - 569 | 534.5 | 2 | 0.04 |
| 569 - 638 | 603.5 | 5 | 0.10 |
| 638 - 707 | 672.5 | 15 | 0.30 |
| 707 - 776 | 741.5 | 6 | 0.12 |
| 776 - 845 | 810.5 | 10 | 0.20 |
| 845 - 914 | 879.5 | 10 | 0.20 |
| 914 - 983 | 948.5 | 2 | 0.04 |
Page 57: High school girls’ 100m track race results:
race.times <- c(16.3, 22.4, 18.5, 18.7, 20.1);
mean(race.times) %>% paste("Average:", ., "\n") %>% cat()
median(race.times) %>% paste("Median:", ., "\n") %>% cat()
StdDev(race.times) %>% round(2) %>% paste("Standard Deviation:", ., "\n") %>% cat()
Average: 19.2
Median: 18.7
Standard Deviation: 2.01
Page 61: Results of “Do you like or dislike the new uniform design” survey:
school.uniform.survey <- factor(
c("LIKE", "NEITHER", "LIKE", "NEITHER", "DISLIKE", "LIKE", "LIKE", "LIKE", "LIKE", "LIKE",
"LIKE", "LIKE", "NEITHER", "LIKE", "LIKE", "NEITHER", "LIKE", "LIKE", "LIKE", "LIKE",
"LIKE", "LIKE", "DISLIKE", "NEITHER", "LIKE", "LIKE", "DISLIKE", "LIKE", "LIKE", "LIKE",
"NEITHER", "NEITHER", "LIKE", "DISLIKE", "LIKE", "LIKE", "LIKE", "LIKE", "NEITHER", "LIKE"))
Page 62: Frequency table can be obtained with table function. Sum totals can be added with addmargins function:
school.uniform.table <- table(school.uniform.survey)
school.uniform.table %>% addmargins()
| DISLIKE | LIKE | NEITHER | Sum |
|---|---|---|---|
| 4 | 28 | 8 | 40 |
Percent Table:
school.uniform.ptable <- prop.table(school.uniform.table) * 100
school.uniform.ptable %>% addmargins()
| DISLIKE | LIKE | NEITHER | Sum |
|---|---|---|---|
| 10 | 70 | 20 | 100 |
Let’s write a function to make a table like on page 62:
CrossTab <- function(fact){
tab <- addmargins(table(fact))
ptab <- addmargins(prop.table(table(fact)))
ctab <- data.frame(Response = unlist(unname(dimnames(tab))),
Frequency = as.vector(tab),
Procent = as.vector(ptab) * 100)
return(ctab)
}
CrossTab(school.uniform.survey)
| Response | Frequency | Procent |
|---|---|---|
| DISLIKE | 4 | 10 |
| LIKE | 28 | 70 |
| NEITHER | 8 | 20 |
| Sum | 40 | 100 |
Graphical representation of categorical variable:
ggplot(NULL, aes(x = school.uniform.survey)) + theme_bw() +
geom_bar(colour = "black", fill = "gray") +
ggtitle("Results of school uniform survey") +
xlab("Answer") + ylab("Count") +
coord_cartesian(ylim = c(0, 30))
Page 64: Do you expect party A to win or lose against party B?
party.victory <- factor(c("Lose", "Lose", "Lose", "I don't know", "Win", "Lose", "Win", "I don't know", "Lose", "Lose"))
CrossTab(party.victory)
| Response | Frequency | Procent |
|---|---|---|
| I don’t know | 2 | 20 |
| Lose | 6 | 60 |
| Win | 2 | 20 |
| Sum | 10 | 100 |
Page 67-69: Test scores for four school subjects:
test.scores <- data.frame(
Student = c("Rui", "Yumi", LETTERS[1:16]),
English = c(90, 81, 73, 97, 85, 60, 74, 64, 72, 67, 87, 78, 85, 96, 77 ,100 ,92 ,86),
Japanese = c(71, 90, 79, 70, 67, 66, 60, 83, 57, 85, 93, 89, 78, 74, 65, 78, 53, 80),
History = c(73, 61, 14, 41, 49, 87, 69, 65, 36, 7, 53, 100, 57, 45, 56, 34, 37, 70),
Biology = c(59, 73, 47, 38, 63, 56, 15, 53, 80, 50, 41, 62, 44, 26, 91, 35, 53, 68)
)
test.scores
| Student | English | Japanese | History | Biology |
|---|---|---|---|---|
| Rui | 90 | 71 | 73 | 59 |
| Yumi | 81 | 90 | 61 | 73 |
| A | 73 | 79 | 14 | 47 |
| B | 97 | 70 | 41 | 38 |
| C | 85 | 67 | 49 | 63 |
| D | 60 | 66 | 87 | 56 |
| E | 74 | 60 | 69 | 15 |
| F | 64 | 83 | 65 | 53 |
| G | 72 | 57 | 36 | 80 |
| H | 67 | 85 | 7 | 50 |
| I | 87 | 93 | 53 | 41 |
| J | 78 | 89 | 100 | 62 |
| K | 85 | 78 | 57 | 44 |
| L | 96 | 74 | 45 | 26 |
| M | 77 | 65 | 56 | 91 |
| N | 100 | 78 | 34 | 35 |
| O | 92 | 53 | 37 | 53 |
| P | 86 | 80 | 70 | 68 |
Page 68: Average test scores by subject can be calculated using summarise_each function, dropping first column with students names using -Student parameter:
# number of significant digits to print
options(digits = 3)
test.scores %>% summarise_each(funs(mean), -Student)
| English | Japanese | History | Biology |
|---|---|---|---|
| 81.3 | 74.3 | 53 | 53 |
Page 72: Normalization of a vector can be done using built-in scale function, but by default it will use sample standard deviation (with \(n-1\) denominator), so we must change scale parameter:
Normalize <- function(values) {
norm.v <- scale(values, scale = StdDev(values))
return(norm.v)
}
options(digits = 2)
# Normalize each column except Student
test.scores %>% mutate_each(funs(Normalize), -Student)
| Student | English | Japanese | History | Biology |
|---|---|---|---|---|
| Rui | 0.77 | -0.30 | 0.88 | 0.33 |
| Yumi | -0.03 | 1.39 | 0.35 | 1.09 |
| A | -0.74 | 0.41 | -1.71 | -0.33 |
| B | 1.39 | -0.39 | -0.53 | -0.82 |
| C | 0.33 | -0.65 | -0.18 | 0.55 |
| D | -1.90 | -0.74 | 1.49 | 0.16 |
| E | -0.65 | -1.27 | 0.70 | -2.08 |
| F | -1.54 | 0.77 | 0.53 | 0.00 |
| G | -0.83 | -1.54 | -0.75 | 1.48 |
| H | -1.27 | 0.95 | -2.02 | -0.16 |
| I | 0.50 | 1.66 | 0.00 | -0.66 |
| J | -0.30 | 1.30 | 2.07 | 0.49 |
| K | 0.33 | 0.33 | 0.18 | -0.49 |
| L | 1.30 | -0.03 | -0.35 | -1.48 |
| M | -0.39 | -0.83 | 0.13 | 2.08 |
| N | 1.66 | 0.33 | -0.84 | -0.98 |
| O | 0.95 | -1.90 | -0.70 | 0.00 |
| P | 0.41 | 0.50 | 0.75 | 0.82 |
Page 74: Deviation score:
DeviationScore <- function(values){
return(Normalize(values) * 10 + 50)
}
options(digits = 3)
test.scores %>% mutate_each(funs(DeviationScore), -Student)
| Student | English | Japanese | History | Biology |
|---|---|---|---|---|
| Rui | 57.7 | 47.0 | 58.8 | 53.3 |
| Yumi | 49.7 | 63.9 | 53.5 | 60.9 |
| A | 42.6 | 54.1 | 32.9 | 46.7 |
| B | 63.9 | 46.1 | 44.7 | 41.8 |
| C | 53.3 | 43.5 | 48.2 | 55.5 |
| D | 31.0 | 42.6 | 65.0 | 51.6 |
| E | 43.5 | 37.3 | 57.0 | 29.2 |
| F | 34.6 | 57.7 | 55.3 | 50.0 |
| G | 41.7 | 34.6 | 42.5 | 64.8 |
| H | 37.3 | 59.5 | 29.8 | 48.4 |
| I | 55.0 | 66.6 | 50.0 | 43.4 |
| J | 47.0 | 63.0 | 70.7 | 54.9 |
| K | 53.3 | 53.3 | 51.8 | 45.1 |
| L | 63.0 | 49.7 | 46.5 | 35.2 |
| M | 46.1 | 41.7 | 51.3 | 70.8 |
| N | 66.6 | 53.3 | 41.6 | 40.2 |
| O | 59.5 | 31.0 | 43.0 | 50.0 |
| P | 54.1 | 55.0 | 57.5 | 58.2 |
#reset to default value
options(digits = 7)
Page 78: Mean and Standard Deviation of the standard scores of Girls’ 100m track race times:
mean(Normalize(race.times)) %>% round(4) %>% paste("Mean:", ., "\n") %>% cat()
Mean: 0
StdDev(Normalize(race.times)) %>% paste("Standard Deviation:", ., "\n") %>% cat()
Standard Deviation: 1
Page 93: What is the area under the standard normal curve from \(0\) to \(1.96\)? The table in the book ranges from \(0\) to \(x\), but pnorm function in R use range from \(-\infty\) to \(x\). So we can subtract pnorm(0) from pnorm(1.96) to get correct result:
(pnorm(1.96) - pnorm(0)) %>% round(4)
[1] 0.475
Page 97: What is the area under the standard normal curve from \(1.8\) to \(\infty\)? In order to range from \(x\) to \(\infty\) we can use lower.tail = FALSE parameter.
pnorm(1.8, lower.tail = FALSE) %>% round(4)
[1] 0.0359
Page 104: What is the \(\chi^2\) value when the probability is \(0.05\) and \(1\) degree of freedom? The function qchisq by default returns \(q\) for which \(P(X < q)=p\). The table in the book assumes \(P(X \geq q)=p\). To adjust, add lower.tail = FALSE parameter:
qchisq(0.05, df = 1, lower.tail = FALSE) %>% round(4)
[1] 3.8415
Page 108: What is the area under the standard normal curve from \(-\infty\) to \(-0.29\)?
pnorm(-0.29) %>% round(4)
[1] 0.3859
Page 108: What is the \(\chi^2\) value when the probability is \(0.05\) and \(2\) degrees of freedom?
qchisq(0.05, df = 2, lower.tail = FALSE) %>% round(4)
[1] 5.9915
Page 116: Monthly expenditures on makeup and clothes:
street.survey <- data.frame(
Respondent = c(paste("Ms.", LETTERS[1:10])),
Makeup = c(3000, 5000, 12000, 2000, 7000, 15000, 5000, 6000, 8000, 10000),
Clothes = c(7000, 8000, 25000, 5000, 12000, 30000, 10000, 15000, 20000, 18000)
)
street.survey
| Respondent | Makeup | Clothes |
|---|---|---|
| Ms. A | 3000 | 7000 |
| Ms. B | 5000 | 8000 |
| Ms. C | 12000 | 25000 |
| Ms. D | 2000 | 5000 |
| Ms. E | 7000 | 12000 |
| Ms. F | 15000 | 30000 |
| Ms. G | 5000 | 10000 |
| Ms. H | 6000 | 15000 |
| Ms. I | 8000 | 20000 |
| Ms. J | 10000 | 18000 |
Scatterplot of survey results using ggplot2 package:
ggplot(street.survey, aes(Makeup, Clothes)) + geom_point(size=3) + theme_bw() +
ggtitle("Scatter chart of monthly expenditures on makeup and clothes") +
xlab("Amount spent on makeup (¥)") + ylab("Amount spent on clothes (¥)") +
coord_cartesian(xlim = c(0, 30000), ylim = c(0, 32000))
Page 118: Correlation coefficient between expenditures on makeup and expenditures on clothes:
street.survey %$% cor(Makeup, Clothes) %>% round(4)
[1] 0.968
Page 121: Age and favorite fashion brand:
brand.survey <- data.frame(
Respondent = c(paste("Ms.", LETTERS[1:15])),
Age = c(27, 33, 16, 29, 32, 23, 25, 28, 22, 18, 26, 26, 15, 29, 26),
Brand = as.factor(c("Theremes", "Channelior", "Bureperry", "Bureperry", "Channelior",
"Theremes", "Channelior", "Theremes", "Bureperry", "Bureperry",
"Channelior", "Theremes", "Bureperry", "Channelior", "Bureperry"))
)
brand.survey
| Respondent | Age | Brand |
|---|---|---|
| Ms. A | 27 | Theremes |
| Ms. B | 33 | Channelior |
| Ms. C | 16 | Bureperry |
| Ms. D | 29 | Bureperry |
| Ms. E | 32 | Channelior |
| Ms. F | 23 | Theremes |
| Ms. G | 25 | Channelior |
| Ms. H | 28 | Theremes |
| Ms. I | 22 | Bureperry |
| Ms. J | 18 | Bureperry |
| Ms. K | 26 | Channelior |
| Ms. L | 26 | Theremes |
| Ms. M | 15 | Bureperry |
| Ms. N | 29 | Channelior |
| Ms. O | 26 | Bureperry |
Page 122: Scatterplot of girl’s “Favorite fashion brand and age” survey:
ggplot(brand.survey, aes(Brand, Age)) + geom_point(size=3) +
theme_bw() + ggtitle("Scatter chart of favorite fashion brand and age")
Page 123-124: What is the correlation ratio between a girl’s age and a girl’s favorite fashion brand? There is no built-in function for correlation ratio, so we must do all calculations ourselves. I am using here the popular dplyr package, which is very handy for various data manipulations:
# Calculating intraclass variance
intraclass.var <- brand.survey %>% group_by(Brand) %>%
summarise(S = sum((Age - mean(Age))^2)) %>%
summarise(sum(S)) %>% as.numeric()
# Average for all data
age.mean <- brand.survey %>% summarise(mean(Age))
# Calculating interclass variance
interclass.var <- brand.survey %>% group_by(Brand) %>%
summarise(S = n()*(mean(Age) - age.mean)^2) %>%
summarise(sum(S)) %>% as.numeric()
cor.ratio <- interclass.var / (intraclass.var + interclass.var)
intraclass.var %>% round(4) %>% paste("Intraclass variance:", ., "\n") %>% cat()
interclass.var %>% round(4) %>% paste("Interclass variance:", ., "\n") %>% cat()
cor.ratio %>% round(4) %>% paste("Correlation Ratio:", ., "\n") %>% cat()
Intraclass variance: 224
Interclass variance: 180
Correlation Ratio: 0.4455
Page 128: Cross Tabluation of sex and desired way of being asked out:
asked.out.table <- matrix(c(34, 61, 53, 38, 40, 74), byrow=T, nrow=2)
dimnames(asked.out.table) = list(Sex=c("Female", "Male"),
DesiredWay = c("Phone", "E-mail", "Face to face"))
asked.out.table %>% addmargins()
| Phone | Face to face | Sum | ||
|---|---|---|---|---|
| Female | 34 | 61 | 53 | 148 |
| Male | 38 | 40 | 74 | 152 |
| Sum | 72 | 101 | 127 | 300 |
Page 132: Calculate Pearson’s \(\chi^2_0\) test statistic using built-in chisq.test function:
asked.out.chisq <- chisq.test(asked.out.table)$statistic
asked.out.chisq %>% round(4)
X-squared
8.0091
Page 132: Calculate Cramer’s coefficient:
Cramers <- function(ctable) {
chi2 <- as.numeric(chisq.test(ctable)$statistic)
s <- sum(ctable)
d <- min(dim(ctable))
return(sqrt(chi2 / (s * (d - 1))))
}
Cramers(asked.out.table) %>% round(4)
[1] 0.1634
Page 138: What is Cramer’s coefficient of a table on ordered food and tea and coffee preferences?
drink.table <- matrix(c(43, 33, 51, 53, 29, 41), byrow=T, nrow=3)
dimnames(drink.table) = list(FoodType=c("Japanese", "European", "Chinese"),
Preference = c("Coffee", "Tea"))
drink.table %>% addmargins()
| Coffee | Tea | Sum | |
|---|---|---|---|
| Japanese | 43 | 33 | 76 |
| European | 51 | 53 | 104 |
| Chinese | 29 | 41 | 70 |
| Sum | 123 | 127 | 250 |
Page 141: Calculate Pearson’s \(\chi^2_0\) test statistic:
chisq.test(drink.table)$statistic %>% round(4)
X-squared
3.3483
Page 141: Calculate Cramer’s coefficient:
Cramers(drink.table) %>% round(4)
[1] 0.1157
Let’s do Pearson’s chi-square statistic (\(\chi^2_0\)) and chi-square distribution experiment described on pages 152-153:
Page 153: Contingency table of sex and derired way of being asked out for “all high students residing in Japan”:
asked.out.table.all <- as.table(matrix(c(400, 1600, 2000, 600, 2400, 3000),
byrow=T, nrow=2))
dimnames(asked.out.table.all) = list(Sex=c("Female", "Male"),
DesiredWay = c("Phone", "E-mail", "Face to face"))
asked.out.table.all %>% addmargins()
| Phone | Face to face | Sum | ||
|---|---|---|---|---|
| Female | 400 | 1600 | 2000 | 4000 |
| Male | 600 | 2400 | 3000 | 6000 |
| Sum | 1000 | 4000 | 5000 | 10000 |
We’ll make a data frame with frequencies of all combinations of the levels of two factors:
asked.out.freq <- as.data.frame(asked.out.table.all)
asked.out.freq
| Sex | DesiredWay | Freq |
|---|---|---|
| Female | Phone | 400 |
| Male | Phone | 600 |
| Female | 1600 | |
| Male | 2400 | |
| Female | Face to face | 2000 |
| Male | Face to face | 3000 |
To make a raw data table (like the table 7-1 on page 153) from this table we should repeat all rows Freq times:
FreqToCases <- function(row) {
n <- row$Freq
df <- data.frame(
Sex = rep(row$Sex, n),
DesiredWay = rep(row$DesiredWay, n)
)
return(df)
}
# Call FreqToCases function for every row of asked.out.freq and
# glue all results in a single data frame
asked.out.data <- asked.out.freq %>% rowwise() %>% do(FreqToCases(.))
Then we will make 10000 different random samples of 300 high school students and calculate \(\chi^2_0\) statistic each time:
chisq.data <- data.frame(Sample = 1:10000)
SampleChiSq <- function(n) {
# random sample of n students
asked.out.sample <- sample_n(asked.out.data, n)
# contingency table for this sample
asked.out.sample.table <- table(asked.out.sample)
# return Chi-squared statistic for this sample
return(chisq.test(asked.out.sample.table)$statistic)
}
# Vectorize a function to be able to make all random samples at once
SampleChiSq %<>% Vectorize()
# Calculating 10000 different Chi-squared statistics
chisq.data$ChiSq <- SampleChiSq(rep(300,10000))
# View first 10 rows of chisq.data
chisq.data %>% head(10) %>% round(5)
| Sample | ChiSq |
|---|---|
| 1 | 0.70046 |
| 2 | 2.46699 |
| 3 | 4.70145 |
| 4 | 0.83618 |
| 5 | 2.85338 |
| 6 | 0.77407 |
| 7 | 2.19792 |
| 8 | 0.04066 |
| 9 | 0.52391 |
| 10 | 0.90191 |
Page 154: Histogram of 10000 \(\chi^2_0\) statistics from 10000 random samples:
# plot title with mathematical annotation
title <- expression(paste("Histogram of ", chi[0]^2, " statistics from 10000 samples"))
# plot using ggplot2 package
chi.plot1 <- ggplot(chisq.data, aes(x = ChiSq)) + theme_bw() +
geom_histogram(aes(y = ..density..), binwidth = 0.5, colour = "black", fill = "gray") +
xlab(expression(paste(chi[0]^2, " value"))) + ylab("Density of distribution") +
ggtitle(title) + coord_cartesian(xlim = c(0, 15), ylim = c(0, 0.52))
chi.plot1
Page 158: Critical value \(\chi^2_{crit}\) for \(0.05\) significance level:
chi.crit <- qchisq(0.05, df = 2, lower.tail = FALSE)
chi.crit
[1] 5.991465
Page 159: We can add a graph of \(\chi^2\)-distribution to the histogram and show the critical region:
# title with mathematical annotation
title <- expression(atop(paste("Histogram of ", chi[0]^2, " statistics from 10000 samples and "),
paste("a graph of ", chi^2, " distribution with 2 degrees of freedom")))
# add some elements to chi.plot1
chi.plot2 <- chi.plot1 + ggtitle(title) +
stat_function(fun = dchisq, args = list(df=2), colour = "blue") +
geom_vline(aes(xintercept = chi.crit), colour = "red", linetype = "dashed") +
annotate("text", label = "{chi[crit]^2}(alpha == 0.05)", parse = TRUE, x = chi.crit + 1.9,
y = 0.07, color = "red", size = 6)
chi.plot2
Histogram look very similar to \(\chi^2\)-distribution, because Cramer’s coefficient for the population “all high school students residing in Japan” is equal to zero:
Cramers(asked.out.table.all) %>% cat()
0
Let’s see how many samples have \(\chi^2_0\) statistic equal or greater than \(\chi^2_{crit}\):
chisq.data %>% filter(ChiSq >= chi.crit) %>% nrow()
[1] 450
And that’s 4.5% (nearly 5%) from all 10000 samples, as expected.
Page 177: P-value:
chisq.test(asked.out.table)$p.value %>% round(4)
[1] 0.0182
Or you can get full summary with:
chisq.test(asked.out.table)
Pearson's Chi-squared test
data: asked.out.table
X-squared = 8.0091, df = 2, p-value = 0.01823
Page 188: \(\chi^2\) test of independence of the table of ordered food and tea and coffee preferences:
chisq.test(drink.table)
Pearson's Chi-squared test
data: drink.table
X-squared = 3.3483, df = 2, p-value = 0.1875
qchisq(0.01, df = 2, lower.tail = FALSE) %>% round(4) %>%
paste("Chi-squared critical value:", ., "\n") %>% cat()
Chi-squared critical value: 9.2103