R4DS Thorndike pages 28 - 38

Hoffman

Thorndike 28-38

In this presentation, we will cover pages 28-38 of Thorndike and Thorndike-Christ (2009).

First, load the necessary libraries and data.

library(tidyverse)
library(googlesheets4)
gs4_deauth() # deauthorize Google Sheet so that anyone can access it
students <- read_sheet("https://docs.google.com/spreadsheets/d/1hPYA-1X5RBlPzlH-tsdnXjM7wN0wq7FF3BmyigUJOPc/edit?usp=sharing")

The “students” dataframe

students
# A tibble: 52 × 7
   first    last     gender class   reading spelling  math
   <chr>    <chr>    <chr>  <chr>     <dbl>    <dbl> <dbl>
 1 Aaron    Andrews  male   Johnson      32       64    43
 2 Byron    Biggs    male   Johnson      40       64    37
 3 Charles  Cowen    male   Johnson      36       40    38
 4 Donna    Davis    female Johnson      41       74    40
 5 Erin     Edwards  female Johnson      36       69    28
 6 Fernando Franco   male   Johnson      41       67    42
 7 Gail     Galaraga female Johnson      40       71    37
 8 Harpo    Henry    male   Johnson      30       51    34
 9 Irrida   Ignacio  female Johnson      37       68    35
10 Jack     Johanson male   Johnson      26       56    26
# ℹ 42 more rows

Math scores in the dataframe

Look at the scores in the Math column and consider how they can be rearranged to give a clearer picture of how the pupils have performed on the math test. The simplest rearrangement is merely to list the scores in order from highest to lowest, as shown in Table 2-2 on page 29.

How do we do this in R?

Frequency Distribution - Math Scores

First, display the frequency of math scores.

table(students$math)

19 21 22 24 25 26 28 29 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 47 48 49 
 1  1  1  1  1  1  1  1  2  1  5  3  1  1  4  5  1  2  1  1  3  2  2  1  1  2 
50 51 52 53 60 
 1  1  1  2  1 

Display Table 2-2 vertically

Note: 31 levels

table2.2 <- cbind(Freq = table(students$math))
table2.2
   Freq
19    1
21    1
22    1
24    1
25    1
26    1
28    1
29    1
31    2
32    1
33    5
34    3
35    1
36    1
37    4
38    5
39    1
40    2
41    1
42    1
43    3
44    2
45    2
47    1
48    1
49    2
50    1
51    1
52    1
53    2
60    1

Fix the lack of zeros

We want to show that some of the math scores are zero. (For example, no one scored 59 on the math test.)

# Define 42 levels between 19 & 60
table.2.2 <- factor(students$math, levels = c(60:19)) 

# Show the table vertically
table2.2 <- cbind(Freq = table(table.2.2))

# Make this a data frame (It was a vector.)
table2.2 <- as.data.frame(table2.2)

# Add a second column using the row names (60 - 19)
table2.2 <- table2.2 |>
  mutate(table2.2, ScoreX = row.names(table2.2), .before = 1)  # Add ScoreX column

Our new Table 2-2

Frequency Distribution of Scores on the Mathemeatics Test for 52 Students

table2.2
   ScoreX Freq
60     60    1
59     59    0
58     58    0
57     57    0
56     56    0
55     55    0
54     54    0
53     53    2
52     52    1
51     51    1
50     50    1
49     49    2
48     48    1
47     47    1
46     46    0
45     45    2
44     44    2
43     43    3
42     42    1
41     41    1
40     40    2
39     39    1
38     38    5
37     37    4
36     36    1
35     35    1
34     34    3
33     33    5
32     32    1
31     31    2
30     30    0
29     29    1
28     28    1
27     27    0
26     26    1
25     25    1
24     24    1
23     23    0
22     22    1
21     21    1
20     20    0
19     19    1

Grouping for clarity

Scores are often grouped into broader categories to further improve the clarity of presentation. We discard some detail in the data to make it easier to grasp the picture presented by the entire set of scores. In our example, we will group three adjacent scores, so that each grouping interval includes three points of score. The entire range of scores from 19 to 60 is represented by 14 intervals, each of which includes three scores.

Grouped Frequency Distribution

As shown on page 31 (one way to approach this)

# create a vector called "bins", counting by threes
bins <- seq(17, 62, by=3) 
# NOTE: Starting at 17 lines up with Thorndike

# Then create a vector called "Interval"
Interval <- cut(students$math, bins)
# The "cut()" command divides the range of Table.2.1$Math into intervals and codes the values in x according to which interval they fall.

table(Interval) # Produces a horizontal table
Interval
(17,20] (20,23] (23,26] (26,29] (29,32] (32,35] (35,38] (38,41] (41,44] (44,47] 
      1       2       3       2       3       9      10       4       6       3 
(47,50] (50,53] (53,56] (56,59] (59,62] 
      4       4       0       0       1 

Table 2-3

Grouped Frequency Distribution of Scores from 52 Students on a Math Test Using an Interval of 3

# transform() makes a vertical table like Table.2.3
table.2.3 <- transform(table(Interval))
table.2.3
   Interval Freq
1   (17,20]    1
2   (20,23]    2
3   (23,26]    3
4   (26,29]    2
5   (29,32]    3
6   (32,35]    9
7   (35,38]   10
8   (38,41]    4
9   (41,44]    6
10  (44,47]    3
11  (47,50]    4
12  (50,53]    4
13  (53,56]    0
14  (56,59]    0
15  (59,62]    1

Table 2-3

This table with 60 at the top (descending order)

arrange(table.2.3, desc(Interval)) 
   Interval Freq
1   (59,62]    1
2   (56,59]    0
3   (53,56]    0
4   (50,53]    4
5   (47,50]    4
6   (44,47]    3
7   (41,44]    6
8   (38,41]    4
9   (35,38]   10
10  (32,35]    9
11  (29,32]    3
12  (26,29]    2
13  (23,26]    3
14  (20,23]    2
15  (17,20]    1

Cumulative Frequency Distribution

A cumulative frequency distribution is easily prepared from the frequency distribution or grouped frequency distribution, as shown in Table 2-4, which presents the cumulative frequency, as well as the frequency in each interval. Each entry in the column labeled “Cumulative Frequency” shows the total number of individuals having a score equal to or less than the highest score in that interval.

Create data frame

Note: There are more elegant ways to do this.

This code produces table.2.4 that runs from 17 - 62.

# Create data frame of Math scores
table.2.4 <- data.frame(table.2.3) |>
  mutate(Cumulative_Frequency = cumsum(Freq)) |>
  mutate(Cumulative_Percent = round(100*cumsum(Freq)/52))

To display this table in descending order, see the code on the next slide.

Table 2-4

arrange(table.2.4, desc(Interval))
   Interval Freq Cumulative_Frequency Cumulative_Percent
1   (59,62]    1                   52                100
2   (56,59]    0                   51                 98
3   (53,56]    0                   51                 98
4   (50,53]    4                   51                 98
5   (47,50]    4                   47                 90
6   (44,47]    3                   43                 83
7   (41,44]    6                   40                 77
8   (38,41]    4                   34                 65
9   (35,38]   10                   30                 58
10  (32,35]    9                   20                 38
11  (29,32]    3                   11                 21
12  (26,29]    2                    8                 15
13  (23,26]    3                    6                 12
14  (20,23]    2                    3                  6
15  (17,20]    1                    1                  2

Graphic representation of univariate data.

Univariate data is data that consists of only one variable. The data can be displayed in a variety of ways, including histograms, cumulative frequency curves, and step curves.

Histograms

ggplot(students, aes(x = math, y = after_stat(count))) +
  geom_histogram(binwidth = 3, color = "black", fill = "grey") + 
  theme_classic() +
  labs(x = "Mathematics test scores",
       y = "Frequency") +
  scale_x_continuous(breaks = 17 + c(0:15)*3) +
  scale_y_continuous(breaks = 0 + c(0:6)*2) + 
  ggtitle("Figure 2-1\nHistogram of 52 mathematics scores")

Figure 2-1

Cumulative Frequency Curve

# Again, this is one way of approaching it. Not super elegant though
MathByThrees <- data.frame(table.2.4) |>
  mutate(Threes = 20 + c(0:14)*3) 

# Figure 2-3
ggplot(MathByThrees, aes(x=Threes, y=Cumulative_Frequency)) +
  geom_line() +
  geom_point() +
  theme_classic() +
  labs(x = "Math Score",
       y = "Cumulative frequency") +
  ggtitle("Figure 2-3\nCumulative frequency curve") +
  scale_x_continuous(breaks = 18 + c(0:14)*3)

Figure 2-3

Step Curve

Maybe it doesn’t make sense to put the data into bins of 3 when we an ogive like Figure 2-3. Another way to approach this is to plot an empirical cumulative distribution function (ECDF) or step curve.

ggplot(students, aes(x = math)) +
  stat_ecdf(geom = "step") +
  theme_classic() +
  labs(x = "Mathematics test scores",
       y = "Cumulative frequency") +
  ggtitle("Figure 2-4\nStep curve of 52 mathematics scores")

Figure 2-4

Cumulative Percent Curve

And one last detail: Let’s change the code so that we have a cumulative percentage on the y-axis.

# Figure 2-5
ggplot(students, aes(x = math)) +
  stat_ecdf(geom = "step") + 
  #Produce empirical cumulative density function
  scale_y_continuous(labels = scales::percent) + 
  #change from proportion to percentage
  theme_classic() +
  labs(x = "Math Score",
       y = "Cumulative percentage") +
  ggtitle("Cumulative frequency (step curve)") 

Figure 2-5