In this presentation, we will cover pages 28-38 of Thorndike and Thorndike-Christ (2009).
First, load the necessary libraries and data.
# A tibble: 52 × 7
first last gender class reading spelling math
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Aaron Andrews male Johnson 32 64 43
2 Byron Biggs male Johnson 40 64 37
3 Charles Cowen male Johnson 36 40 38
4 Donna Davis female Johnson 41 74 40
5 Erin Edwards female Johnson 36 69 28
6 Fernando Franco male Johnson 41 67 42
7 Gail Galaraga female Johnson 40 71 37
8 Harpo Henry male Johnson 30 51 34
9 Irrida Ignacio female Johnson 37 68 35
10 Jack Johanson male Johnson 26 56 26
# ℹ 42 more rows
Look at the scores in the Math column and consider how they can be rearranged to give a clearer picture of how the pupils have performed on the math test. The simplest rearrangement is merely to list the scores in order from highest to lowest, as shown in Table 2-2 on page 29.
How do we do this in R?
First, display the frequency of math scores.
Note: 31 levels
We want to show that some of the math scores are zero. (For example, no one scored 59 on the math test.)
# Define 42 levels between 19 & 60
table.2.2 <- factor(students$math, levels = c(60:19))
# Show the table vertically
table2.2 <- cbind(Freq = table(table.2.2))
# Make this a data frame (It was a vector.)
table2.2 <- as.data.frame(table2.2)
# Add a second column using the row names (60 - 19)
table2.2 <- table2.2 |>
mutate(table2.2, ScoreX = row.names(table2.2), .before = 1) # Add ScoreX column
Frequency Distribution of Scores on the Mathemeatics Test for 52 Students
ScoreX Freq
60 60 1
59 59 0
58 58 0
57 57 0
56 56 0
55 55 0
54 54 0
53 53 2
52 52 1
51 51 1
50 50 1
49 49 2
48 48 1
47 47 1
46 46 0
45 45 2
44 44 2
43 43 3
42 42 1
41 41 1
40 40 2
39 39 1
38 38 5
37 37 4
36 36 1
35 35 1
34 34 3
33 33 5
32 32 1
31 31 2
30 30 0
29 29 1
28 28 1
27 27 0
26 26 1
25 25 1
24 24 1
23 23 0
22 22 1
21 21 1
20 20 0
19 19 1
Scores are often grouped into broader categories to further improve the clarity of presentation. We discard some detail in the data to make it easier to grasp the picture presented by the entire set of scores. In our example, we will group three adjacent scores, so that each grouping interval includes three points of score. The entire range of scores from 19 to 60 is represented by 14 intervals, each of which includes three scores.
As shown on page 31 (one way to approach this)
# create a vector called "bins", counting by threes
bins <- seq(17, 62, by=3)
# NOTE: Starting at 17 lines up with Thorndike
# Then create a vector called "Interval"
Interval <- cut(students$math, bins)
# The "cut()" command divides the range of Table.2.1$Math into intervals and codes the values in x according to which interval they fall.
table(Interval) # Produces a horizontal table
Interval
(17,20] (20,23] (23,26] (26,29] (29,32] (32,35] (35,38] (38,41] (41,44] (44,47]
1 2 3 2 3 9 10 4 6 3
(47,50] (50,53] (53,56] (56,59] (59,62]
4 4 0 0 1
Grouped Frequency Distribution of Scores from 52 Students on a Math Test Using an Interval of 3
# transform() makes a vertical table like Table.2.3
table.2.3 <- transform(table(Interval))
table.2.3
Interval Freq
1 (17,20] 1
2 (20,23] 2
3 (23,26] 3
4 (26,29] 2
5 (29,32] 3
6 (32,35] 9
7 (35,38] 10
8 (38,41] 4
9 (41,44] 6
10 (44,47] 3
11 (47,50] 4
12 (50,53] 4
13 (53,56] 0
14 (56,59] 0
15 (59,62] 1
This table with 60 at the top (descending order)
A cumulative frequency distribution is easily prepared from the frequency distribution or grouped frequency distribution, as shown in Table 2-4, which presents the cumulative frequency, as well as the frequency in each interval. Each entry in the column labeled “Cumulative Frequency” shows the total number of individuals having a score equal to or less than the highest score in that interval.
Note: There are more elegant ways to do this.
This code produces table.2.4 that runs from 17 - 62.
To display this table in descending order, see the code on the next slide.
Interval Freq Cumulative_Frequency Cumulative_Percent
1 (59,62] 1 52 100
2 (56,59] 0 51 98
3 (53,56] 0 51 98
4 (50,53] 4 51 98
5 (47,50] 4 47 90
6 (44,47] 3 43 83
7 (41,44] 6 40 77
8 (38,41] 4 34 65
9 (35,38] 10 30 58
10 (32,35] 9 20 38
11 (29,32] 3 11 21
12 (26,29] 2 8 15
13 (23,26] 3 6 12
14 (20,23] 2 3 6
15 (17,20] 1 1 2
Univariate data is data that consists of only one variable. The data can be displayed in a variety of ways, including histograms, cumulative frequency curves, and step curves.
ggplot(students, aes(x = math, y = after_stat(count))) +
geom_histogram(binwidth = 3, color = "black", fill = "grey") +
theme_classic() +
labs(x = "Mathematics test scores",
y = "Frequency") +
scale_x_continuous(breaks = 17 + c(0:15)*3) +
scale_y_continuous(breaks = 0 + c(0:6)*2) +
ggtitle("Figure 2-1\nHistogram of 52 mathematics scores")
# Again, this is one way of approaching it. Not super elegant though
MathByThrees <- data.frame(table.2.4) |>
mutate(Threes = 20 + c(0:14)*3)
# Figure 2-3
ggplot(MathByThrees, aes(x=Threes, y=Cumulative_Frequency)) +
geom_line() +
geom_point() +
theme_classic() +
labs(x = "Math Score",
y = "Cumulative frequency") +
ggtitle("Figure 2-3\nCumulative frequency curve") +
scale_x_continuous(breaks = 18 + c(0:14)*3)
Maybe it doesn’t make sense to put the data into bins of 3 when we an ogive like Figure 2-3. Another way to approach this is to plot an empirical cumulative distribution function (ECDF) or step curve.
And one last detail: Let’s change the code so that we have a cumulative percentage on the y-axis.
# Figure 2-5
ggplot(students, aes(x = math)) +
stat_ecdf(geom = "step") +
#Produce empirical cumulative density function
scale_y_continuous(labels = scales::percent) +
#change from proportion to percentage
theme_classic() +
labs(x = "Math Score",
y = "Cumulative percentage") +
ggtitle("Cumulative frequency (step curve)")