R code to The Manga Guide To Statistics (v0.95)

Chapter 1: Determining data types
Chapter 2: Understanding Numerical Data
Chapter 3: Understanding Categorical Data
Chapter 4: Standard score and deviation score
Chapter 5: Let’s obtain the probability
Chapter 6: Let’s look at the relationship between two variables
Chapter 7: Let’s explore the hypothesis tests

This document contains my R code for computations of all of the examples (with chapter and page numbers) from the book “The Manga Guide to Statistics” published by No Starch Press. This document was made with R Markdown that enables easy authoring of dynamic reports with R code and results of its execution. All results are rounded just like in the book.

I would like to thank James Church for his version of R code (www.nostarch.com/download/MangaGuideToStatistics.R) that gave me motivation to write my own variant. He also kindly permitted me to use his code as a starting point of my variant. Take a look at his version if you want only base R code without using additional packages.

In my code I used several popular R packages:

magrittr — to simplify R code with forward pipe %>% and exposition pipe %$% operators,
dplyr — powerful data manipulation,
ggplot2 — plot creation system, which greatly surpasses base R plots.

Make sure that these libraries are installed (by install.packages function or by RStudio IDE) and loaded:

library(magrittr)
library(dplyr)
library(ggplot2)

Chapter 1: Determining data types

There are no calculations in this chapter.

Chapter 2: Understanding Numerical Data

Page 33: Table with the prices of the top 50 ramen shops can be presented with the following vector:

ramen.prices <- c(700, 850, 600, 650, 980, 750, 500, 890, 880, 700,
                  890, 720, 680, 650, 790, 670, 680, 900, 880, 720,
                  850, 700, 780, 850, 750, 780, 590, 650, 580, 750,
                  800, 550, 750, 700, 600, 800, 800, 880, 790, 790,
                  780, 600, 670, 680, 650, 890, 930, 650, 777, 700)

Page 38: Frequency table data can be calculated by built-in hist function (it’s not only for histogram plotting). Let’s try to make a table like the one on page 38:

DistributionTable <- function(hist.data){
  interval.lower <- hist.data$breaks[-length(hist.data$breaks)]
  interval.upper <- hist.data$breaks[-1]
  hist.table <- data.frame(
    Interval = paste(interval.lower, "-", interval.upper),
    Midpoint = hist.data$mids,
    Frequency = hist.data$counts,
    Rel.Frequency = hist.data$counts / sum(hist.data$counts)
  )
  return(hist.table)
}

ramen.hist <- hist(ramen.prices, breaks = 5, right = FALSE, plot = FALSE)

DistributionTable(ramen.hist)

Interval	Midpoint	Frequency	Rel.Frequency
500 - 600	550	4	0.08
600 - 700	650	13	0.26
700 - 800	750	18	0.36
800 - 900	850	12	0.24
900 - 1000	950	3	0.06

Page 39: Histogram of ramen prices with ggplot function from ggplot2 package. Resulting plot was tweaked to look like the histogram from the book.

ggplot(NULL, aes(x = ramen.prices)) + theme_bw() +
  geom_histogram(binwidth = 100, colour = "black", fill = "gray") +
  ggtitle("Histogram of 50 best ramen shops prices") +
  xlab("Ramen prices") + ylab("Frequency") +
  coord_cartesian(xlim = c(490, 1010), ylim = c(0, 20)) +
  scale_x_continuous(breaks = seq(550, 950, 100))

Page 41: Data frame containing scores of three bowling teams:

bowling <- data.frame(
  team.A = c(86, 73, 124, 111, 90, 38),
  team.B = c(84, 71, 103, 85, 90, 89),
  team.C = c(229, 77, 59, 95, 70, 88)
)

bowling

team.A	team.B	team.C
86	84	229
73	71	77
124	103	59
111	85	95
90	90	70
38	89	88

Page 42: Average scores can be calculated using built-in mean function. And summarise_each function from dplyr package can apply any function to all columns at once:

bowling %>% summarise_each(funs(mean))

team.A	team.B	team.C
87	87	103

Page 43: There is no built-in function for geometric mean, but it’s simple to write special function for that:

GeometricMean <- function(values) {
  gmean <- prod(values) ^ (1 / length(values))
  return(gmean)
}

bowling %>% summarise_each(funs(GeometricMean)) %>% round(3)

team.A	team.B	team.C
81.614	86.478	92.062

Page 43: Harmonic mean of bowling team scores:

HarmonicMean <- function(values) {
  hmean <- 1 / (sum(1 / values) / length(values))
  return(hmean)
}

bowling %>% summarise_each(funs(HarmonicMean)) %>% round(3)

team.A	team.B	team.C
75.163	85.948	85.132

Page 45: Median scores for the bowling teams (there is a built-in median function for that):

bowling %>% summarise_each(funs(median))

team.A	team.B	team.C
88	87	82.5

Page 50: Standard Deviation ($\sigma$). There is a built-in function sd for computing sample standard deviation: $\sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i-\overline{x})^2}$ (see page 52). But we can write a custom function for calculating standard deviation for the entire population $\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i-\overline{x})^2}$:

StdDev <- function(values) {
    m <- mean(values)
    stddev <- sqrt(sum((values - m) ^ 2) / length(values))
    return(stddev)
}

bowling %>% summarise_each(funs(StdDev)) %>% round(1)

team.A	team.B	team.C
27.5	9.5	57.5

Page 55: Sturges’ Rule (using built-in nclass.Sturges function) and breakpoint range of a histogram:

cat("Using Sturges' Rule on ramen.prices vector:\n")
nclass.Sturges(ramen.prices) %>% paste("Number of classes:", ., "\n") %>% cat()

BreakRange <- function(values) {
    return(ceiling((max(values) - min(values)) / nclass.Sturges(values)))
}

BreakRange(ramen.prices) %>% paste("Range of class:", ., "\n") %>% cat()

Using Sturges' Rule on ramen.prices vector:
Number of classes: 7 
Range of class: 69

Page 56: Frequency table of 50 best ramen shops (Sturges’ Rule):

# vector of breakpoints
ramen.breaks <- seq(min(ramen.prices), by = BreakRange(ramen.prices),
                    length.out = nclass.Sturges(ramen.prices) + 1)
ramen.breaks

[1] 500 569 638 707 776 845 914 983

ramen.histS <- hist(ramen.prices, breaks = ramen.breaks, right = FALSE, plot = FALSE)
DistributionTable(ramen.histS)

Interval	Midpoint	Frequency	Rel.Frequency
500 - 569	534.5	2	0.04
569 - 638	603.5	5	0.10
638 - 707	672.5	15	0.30
707 - 776	741.5	6	0.12
776 - 845	810.5	10	0.20
845 - 914	879.5	10	0.20
914 - 983	948.5	2	0.04

Page 57: High school girls’ 100m track race results:

race.times <- c(16.3, 22.4, 18.5, 18.7, 20.1);
mean(race.times) %>% paste("Average:", ., "\n") %>% cat()
median(race.times) %>% paste("Median:", ., "\n") %>% cat()
StdDev(race.times) %>% round(2) %>% paste("Standard Deviation:", ., "\n") %>% cat()

Average: 19.2 
Median: 18.7 
Standard Deviation: 2.01

Chapter 3: Understanding Categorical Data

Page 61: Results of “Do you like or dislike the new uniform design” survey:

school.uniform.survey <- factor(
  c("LIKE", "NEITHER", "LIKE", "NEITHER", "DISLIKE", "LIKE", "LIKE", "LIKE", "LIKE", "LIKE",
    "LIKE", "LIKE", "NEITHER", "LIKE", "LIKE", "NEITHER", "LIKE", "LIKE", "LIKE", "LIKE",
    "LIKE", "LIKE", "DISLIKE", "NEITHER", "LIKE", "LIKE", "DISLIKE", "LIKE", "LIKE", "LIKE",
    "NEITHER", "NEITHER", "LIKE", "DISLIKE", "LIKE", "LIKE", "LIKE", "LIKE", "NEITHER", "LIKE"))

Page 62: Frequency table can be obtained with table function. Sum totals can be added with addmargins function:

school.uniform.table <- table(school.uniform.survey)

school.uniform.table %>% addmargins()

DISLIKE	LIKE	NEITHER	Sum
4	28	8	40

Percent Table:

school.uniform.ptable <- prop.table(school.uniform.table) * 100

school.uniform.ptable %>% addmargins()

DISLIKE	LIKE	NEITHER	Sum
10	70	20	100

Let’s write a function to make a table like on page 62:

CrossTab <- function(fact){
  tab <- addmargins(table(fact))
  ptab <- addmargins(prop.table(table(fact)))
  ctab <- data.frame(Response = unlist(unname(dimnames(tab))),
                     Frequency = as.vector(tab),
                     Procent = as.vector(ptab) * 100)
  return(ctab)
}

CrossTab(school.uniform.survey)

Response	Frequency	Procent
DISLIKE	4	10
LIKE	28	70
NEITHER	8	20
Sum	40	100

Graphical representation of categorical variable:

ggplot(NULL, aes(x = school.uniform.survey)) + theme_bw() +
  geom_bar(colour = "black", fill = "gray") +
  ggtitle("Results of school uniform survey") +
  xlab("Answer") + ylab("Count") +
  coord_cartesian(ylim = c(0, 30))

Page 64: Do you expect party A to win or lose against party B?

party.victory <- factor(c("Lose", "Lose", "Lose", "I don't know", "Win", "Lose", "Win", "I don't know", "Lose", "Lose"))

CrossTab(party.victory)

Response	Frequency	Procent
I don’t know	2	20
Lose	6	60
Win	2	20
Sum	10	100

Chapter 4: Standard score and deviation score

Page 67-69: Test scores for four school subjects:

test.scores <- data.frame(
  Student = c("Rui", "Yumi", LETTERS[1:16]),
  English = c(90, 81, 73, 97, 85, 60, 74, 64, 72, 67, 87, 78, 85, 96, 77 ,100 ,92 ,86),
  Japanese = c(71, 90, 79, 70, 67, 66, 60, 83, 57, 85, 93, 89, 78, 74, 65, 78, 53, 80),
  History = c(73, 61, 14, 41, 49, 87, 69, 65, 36, 7, 53, 100, 57, 45, 56, 34, 37, 70),
  Biology = c(59, 73, 47, 38, 63, 56, 15, 53, 80, 50, 41, 62, 44, 26, 91, 35, 53, 68)
)

test.scores

Student	English	Japanese	History	Biology
Rui	90	71	73	59
Yumi	81	90	61	73
A	73	79	14	47
B	97	70	41	38
C	85	67	49	63
D	60	66	87	56
E	74	60	69	15
F	64	83	65	53
G	72	57	36	80
H	67	85	7	50
I	87	93	53	41
J	78	89	100	62
K	85	78	57	44
L	96	74	45	26
M	77	65	56	91
N	100	78	34	35
O	92	53	37	53
P	86	80	70	68

Page 68: Average test scores by subject can be calculated using summarise_each function, dropping first column with students names using -Student parameter:

# number of significant digits to print
options(digits = 3)
test.scores %>% summarise_each(funs(mean), -Student)

English	Japanese	History	Biology
81.3	74.3	53	53

Page 72: Normalization of a vector can be done using built-in scale function, but by default it will use sample standard deviation (with $n-1$ denominator), so we must change scale parameter:

Normalize <- function(values) {
  norm.v <- scale(values, scale = StdDev(values))
  return(norm.v)
}

options(digits = 2)
# Normalize each column except Student
test.scores %>% mutate_each(funs(Normalize), -Student)

Student	English	Japanese	History	Biology
Rui	0.77	-0.30	0.88	0.33
Yumi	-0.03	1.39	0.35	1.09
A	-0.74	0.41	-1.71	-0.33
B	1.39	-0.39	-0.53	-0.82
C	0.33	-0.65	-0.18	0.55
D	-1.90	-0.74	1.49	0.16
E	-0.65	-1.27	0.70	-2.08
F	-1.54	0.77	0.53	0.00
G	-0.83	-1.54	-0.75	1.48
H	-1.27	0.95	-2.02	-0.16
I	0.50	1.66	0.00	-0.66
J	-0.30	1.30	2.07	0.49
K	0.33	0.33	0.18	-0.49
L	1.30	-0.03	-0.35	-1.48
M	-0.39	-0.83	0.13	2.08
N	1.66	0.33	-0.84	-0.98
O	0.95	-1.90	-0.70	0.00
P	0.41	0.50	0.75	0.82

Page 74: Deviation score:

DeviationScore <- function(values){
  return(Normalize(values) * 10 + 50)
}

options(digits = 3)
test.scores %>% mutate_each(funs(DeviationScore), -Student)

Student	English	Japanese	History	Biology
Rui	57.7	47.0	58.8	53.3
Yumi	49.7	63.9	53.5	60.9
A	42.6	54.1	32.9	46.7
B	63.9	46.1	44.7	41.8
C	53.3	43.5	48.2	55.5
D	31.0	42.6	65.0	51.6
E	43.5	37.3	57.0	29.2
F	34.6	57.7	55.3	50.0
G	41.7	34.6	42.5	64.8
H	37.3	59.5	29.8	48.4
I	55.0	66.6	50.0	43.4
J	47.0	63.0	70.7	54.9
K	53.3	53.3	51.8	45.1
L	63.0	49.7	46.5	35.2
M	46.1	41.7	51.3	70.8
N	66.6	53.3	41.6	40.2
O	59.5	31.0	43.0	50.0
P	54.1	55.0	57.5	58.2

#reset to default value
options(digits = 7)

Page 78: Mean and Standard Deviation of the standard scores of Girls’ 100m track race times:

mean(Normalize(race.times)) %>% round(4) %>% paste("Mean:", ., "\n") %>% cat()

Mean: 0

StdDev(Normalize(race.times)) %>% paste("Standard Deviation:", ., "\n") %>% cat()

Standard Deviation: 1

Chapter 5: Let’s obtain the probability

Page 93: What is the area under the standard normal curve from $0$ to $1.96$? The table in the book ranges from $0$ to $x$, but pnorm function in R use range from $-\infty$ to $x$. So we can subtract pnorm(0) from pnorm(1.96) to get correct result:

(pnorm(1.96) - pnorm(0)) %>% round(4)

[1] 0.475

Page 97: What is the area under the standard normal curve from $1.8$ to $\infty$? In order to range from $x$ to $\infty$ we can use lower.tail = FALSE parameter.

pnorm(1.8, lower.tail = FALSE) %>% round(4)

[1] 0.0359

Page 104: What is the $\chi^2$ value when the probability is $0.05$ and $1$ degree of freedom? The function qchisq by default returns $q$ for which $P(X < q)=p$. The table in the book assumes $P(X \geq q)=p$. To adjust, add lower.tail = FALSE parameter:

qchisq(0.05, df = 1, lower.tail = FALSE) %>% round(4)

[1] 3.8415

Page 108: What is the area under the standard normal curve from $-\infty$ to $-0.29$?

pnorm(-0.29) %>% round(4)

[1] 0.3859

Page 108: What is the $\chi^2$ value when the probability is $0.05$ and $2$ degrees of freedom?

qchisq(0.05, df = 2, lower.tail = FALSE) %>% round(4)

[1] 5.9915

Chapter 6: Let’s look at the relationship between two variables

Page 116: Monthly expenditures on makeup and clothes:

street.survey <- data.frame(
  Respondent = c(paste("Ms.", LETTERS[1:10])),
  Makeup = c(3000,  5000, 12000,  2000,  7000, 15000,  5000,  6000,  8000, 10000),
  Clothes = c(7000,  8000, 25000,  5000, 12000, 30000, 10000, 15000, 20000, 18000)
)

street.survey

Respondent	Makeup	Clothes
Ms. A	3000	7000
Ms. B	5000	8000
Ms. C	12000	25000
Ms. D	2000	5000
Ms. E	7000	12000
Ms. F	15000	30000
Ms. G	5000	10000
Ms. H	6000	15000
Ms. I	8000	20000
Ms. J	10000	18000

Scatterplot of survey results using ggplot2 package:

ggplot(street.survey, aes(Makeup, Clothes)) + geom_point(size=3) + theme_bw() +
  ggtitle("Scatter chart of monthly expenditures on makeup and clothes") +
  xlab("Amount spent on makeup (¥)") + ylab("Amount spent on clothes (¥)") +
  coord_cartesian(xlim = c(0, 30000), ylim = c(0, 32000))

Page 118: Correlation coefficient between expenditures on makeup and expenditures on clothes:

street.survey %$% cor(Makeup, Clothes) %>% round(4)

[1] 0.968

Page 121: Age and favorite fashion brand:

brand.survey <- data.frame(
  Respondent = c(paste("Ms.", LETTERS[1:15])),
  Age = c(27, 33, 16, 29, 32, 23, 25, 28, 22, 18, 26, 26, 15, 29, 26),
  Brand = as.factor(c("Theremes", "Channelior", "Bureperry", "Bureperry", "Channelior",
                      "Theremes", "Channelior", "Theremes", "Bureperry", "Bureperry",
                      "Channelior", "Theremes", "Bureperry", "Channelior", "Bureperry"))
)

brand.survey

Respondent	Age	Brand
Ms. A	27	Theremes
Ms. B	33	Channelior
Ms. C	16	Bureperry
Ms. D	29	Bureperry
Ms. E	32	Channelior
Ms. F	23	Theremes
Ms. G	25	Channelior
Ms. H	28	Theremes
Ms. I	22	Bureperry
Ms. J	18	Bureperry
Ms. K	26	Channelior
Ms. L	26	Theremes
Ms. M	15	Bureperry
Ms. N	29	Channelior
Ms. O	26	Bureperry

Page 122: Scatterplot of girl’s “Favorite fashion brand and age” survey:

ggplot(brand.survey, aes(Brand, Age)) + geom_point(size=3) +
  theme_bw() + ggtitle("Scatter chart of favorite fashion brand and age")

Page 123-124: What is the correlation ratio between a girl’s age and a girl’s favorite fashion brand? There is no built-in function for correlation ratio, so we must do all calculations ourselves. I am using here the popular dplyr package, which is very handy for various data manipulations:

# Calculating intraclass variance
intraclass.var <- brand.survey %>% group_by(Brand) %>% 
                  summarise(S = sum((Age - mean(Age))^2)) %>%
                  summarise(sum(S)) %>% as.numeric()

# Average for all data
age.mean <- brand.survey %>% summarise(mean(Age))

# Calculating interclass variance
interclass.var <- brand.survey %>% group_by(Brand) %>%
                  summarise(S = n()*(mean(Age) - age.mean)^2) %>%
                  summarise(sum(S)) %>% as.numeric()

cor.ratio <- interclass.var / (intraclass.var + interclass.var)

intraclass.var %>% round(4) %>% paste("Intraclass variance:", ., "\n") %>% cat()
interclass.var %>% round(4) %>% paste("Interclass variance:", ., "\n") %>% cat()
cor.ratio %>% round(4) %>% paste("Correlation Ratio:", ., "\n") %>% cat()

Intraclass variance: 224 
Interclass variance: 180 
Correlation Ratio: 0.4455

Page 128: Cross Tabluation of sex and desired way of being asked out:

asked.out.table <- matrix(c(34, 61, 53, 38, 40, 74), byrow=T, nrow=2)
dimnames(asked.out.table) = list(Sex=c("Female", "Male"),
  DesiredWay = c("Phone", "E-mail", "Face to face"))

asked.out.table %>% addmargins()

	Phone	E-mail	Face to face	Sum
Female	34	61	53	148
Male	38	40	74	152
Sum	72	101	127	300

Page 132: Calculate Pearson’s $\chi^2_0$ test statistic using built-in chisq.test function:

asked.out.chisq <- chisq.test(asked.out.table)$statistic
asked.out.chisq %>% round(4)

X-squared 
   8.0091

Page 132: Calculate Cramer’s coefficient:

Cramers <- function(ctable) {
    chi2 <- as.numeric(chisq.test(ctable)$statistic)
    s <- sum(ctable)
    d <- min(dim(ctable))
    return(sqrt(chi2 / (s * (d - 1))))
}

Cramers(asked.out.table) %>% round(4)

[1] 0.1634

Page 138: What is Cramer’s coefficient of a table on ordered food and tea and coffee preferences?

drink.table <- matrix(c(43, 33, 51, 53, 29, 41), byrow=T, nrow=3)
dimnames(drink.table) = list(FoodType=c("Japanese", "European", "Chinese"),
  Preference = c("Coffee", "Tea"))

drink.table %>% addmargins()

	Coffee	Tea	Sum
Japanese	43	33	76
European	51	53	104
Chinese	29	41	70
Sum	123	127	250

Page 141: Calculate Pearson’s $\chi^2_0$ test statistic:

chisq.test(drink.table)$statistic %>% round(4)

X-squared 
   3.3483

Page 141: Calculate Cramer’s coefficient:

Cramers(drink.table) %>% round(4)

[1] 0.1157

Chapter 7: Let’s explore the hypothesis tests

Let’s do Pearson’s chi-square statistic ($\chi^2_0$) and chi-square distribution experiment described on pages 152-153:

Page 153: Contingency table of sex and derired way of being asked out for “all high students residing in Japan”:

asked.out.table.all <- as.table(matrix(c(400, 1600, 2000, 600, 2400, 3000),
                                   byrow=T, nrow=2))
dimnames(asked.out.table.all) = list(Sex=c("Female", "Male"),
  DesiredWay = c("Phone", "E-mail", "Face to face"))

asked.out.table.all %>% addmargins()

	Phone	E-mail	Face to face	Sum
Female	400	1600	2000	4000
Male	600	2400	3000	6000
Sum	1000	4000	5000	10000

We’ll make a data frame with frequencies of all combinations of the levels of two factors:

asked.out.freq <- as.data.frame(asked.out.table.all)

asked.out.freq

Sex	DesiredWay	Freq
Female	Phone	400
Male	Phone	600
Female	E-mail	1600
Male	E-mail	2400
Female	Face to face	2000
Male	Face to face	3000

To make a raw data table (like the table 7-1 on page 153) from this table we should repeat all rows Freq times:

FreqToCases <- function(row) {
  n <- row$Freq
  df <- data.frame(
    Sex = rep(row$Sex, n),
    DesiredWay = rep(row$DesiredWay, n)
  )
  return(df)
}

# Call FreqToCases function for every row of asked.out.freq and
# glue all results in a single data frame
asked.out.data <- asked.out.freq %>% rowwise() %>% do(FreqToCases(.))

Then we will make 10000 different random samples of 300 high school students and calculate $\chi^2_0$ statistic each time:

chisq.data <- data.frame(Sample = 1:10000)

SampleChiSq <- function(n) {
  # random sample of n students
  asked.out.sample <- sample_n(asked.out.data, n)
  # contingency table for this sample
  asked.out.sample.table <- table(asked.out.sample)
  # return Chi-squared statistic for this sample
  return(chisq.test(asked.out.sample.table)$statistic)
}

# Vectorize a function to be able to make all random samples at once
SampleChiSq %<>% Vectorize()

# Calculating 10000 different Chi-squared statistics
chisq.data$ChiSq <- SampleChiSq(rep(300,10000))

# View first 10 rows of chisq.data
chisq.data %>% head(10) %>% round(5)

Sample	ChiSq
1	0.70046
2	2.46699
3	4.70145
4	0.83618
5	2.85338
6	0.77407
7	2.19792
8	0.04066
9	0.52391
10	0.90191

Page 154: Histogram of 10000 $\chi^2_0$ statistics from 10000 random samples:

# plot title with mathematical annotation
title <- expression(paste("Histogram of ", chi[0]^2, " statistics from 10000 samples"))

# plot using ggplot2 package
chi.plot1 <- ggplot(chisq.data, aes(x = ChiSq)) + theme_bw() +
  geom_histogram(aes(y = ..density..), binwidth = 0.5, colour = "black", fill = "gray") +
  xlab(expression(paste(chi[0]^2, " value"))) + ylab("Density of distribution") +
  ggtitle(title) + coord_cartesian(xlim = c(0, 15), ylim = c(0, 0.52))

chi.plot1

Page 158: Critical value $\chi^2_{crit}$ for $0.05$ significance level:

chi.crit <- qchisq(0.05, df = 2, lower.tail = FALSE)

chi.crit

[1] 5.991465

Page 159: We can add a graph of $\chi^2$-distribution to the histogram and show the critical region:

# title with mathematical annotation
title <- expression(atop(paste("Histogram of ", chi[0]^2, " statistics from 10000 samples and "),
         paste("a graph of ", chi^2, " distribution with 2 degrees of freedom")))

# add some elements to chi.plot1
chi.plot2 <- chi.plot1 + ggtitle(title) +
  stat_function(fun = dchisq, args = list(df=2), colour = "blue") +
  geom_vline(aes(xintercept = chi.crit), colour = "red", linetype = "dashed") +
  annotate("text", label = "{chi[crit]^2}(alpha == 0.05)", parse = TRUE, x = chi.crit + 1.9,
           y = 0.07, color = "red", size = 6)

chi.plot2

Histogram look very similar to $\chi^2$-distribution, because Cramer’s coefficient for the population “all high school students residing in Japan” is equal to zero:

Cramers(asked.out.table.all) %>% cat()

Let’s see how many samples have $\chi^2_0$ statistic equal or greater than $\chi^2_{crit}$:

chisq.data %>% filter(ChiSq >= chi.crit) %>% nrow()

[1] 450

And that’s 4.5% (nearly 5%) from all 10000 samples, as expected.

Page 177: P-value:

chisq.test(asked.out.table)$p.value %>% round(4)

[1] 0.0182

Or you can get full summary with:

chisq.test(asked.out.table)


    Pearson's Chi-squared test

data:  asked.out.table
X-squared = 8.0091, df = 2, p-value = 0.01823

Page 188: $\chi^2$ test of independence of the table of ordered food and tea and coffee preferences:

chisq.test(drink.table)


    Pearson's Chi-squared test

data:  drink.table
X-squared = 3.3483, df = 2, p-value = 0.1875

qchisq(0.01, df = 2, lower.tail = FALSE) %>% round(4) %>%
  paste("Chi-squared critical value:", ., "\n") %>% cat()

Chi-squared critical value: 9.2103