Class Slides on Error

Hoffman

Error

Credit to Professor Dan Koretz of Harvard University for many of these explanations.

One of the most important concepts in measurement and statistics is the notion of error. And it appears in many contexts:

the “margin of error” reported in polls
reports of “statistically significant” research findings
reports of test scores provided to k12 students and their to parents, where a student’s level of performance falls within a band surrounding their test score.

Error to consider

Chapter 7: Error and Reliability:
“How much we don’t know what we’re talking about”

Measurement Error?

Sampling Error?

Timing Hoffman’s laps

Hoffman swims laps at the college pool. He’s not very fast, but wants to get better. So he enlists some people to help him improve, and they have him swim some 50s to see how fast he is and what needs to be done.

Hoffman swims his fastest (after warming up)
timing using their phones
some of Hoffman’s laps are faster than others
small inconsistencies in starting and stopping the timers

Random or Systematic?

It’s is not always clear whether inconsistency is random or systematic.

If the several timers don’t agree on what counts as stop or start
timers might not have a clear view of when Hoffman touches the wall at the end
small differences in pushing the start stop button

For now, focus on truly random error, knowing that apparently random inconsistencies can stem from systematic as well as random processes.

Timing laps

Three timers at the ready. Hoffman swims 50 yards

# Mean of three different timings
# of the same swim:
Try1 <- .01 * as.integer(rnorm(3, 5171, 25))

Timer 1 recorded 52.2
Timer 2 recorded 51.4
Timer 3 recorded 51.77

The three timers averaged 51.79 with a standard deviation of 0.4

What’s the “true” time?

There is always some inconsistency in any measurement procedure. We don’t know precisely how close we are to the truth, but we assume that there is some true measurement.

Four more timed laps

Assume that Hoffman is all warmed up, and that he’s got enough rest between laps. He’s going to swim 50 yards five times, with three timers, and this is what we see:

Timer 1 recorded 52.2, 50.77, 52.5, 52.84, 52.85
Timer 2 recorded 51.4, 50.74, 52.55, 52.67, 52.91
Timer 3 recorded 51.77, 50.77, 52.89, 53.12, 53.27

Hoffman averaged 51.79, 50.76, 52.65, 52.88, 53.01

Classical Test Theory

Classical test theory assumes there’s a true score (call it T) obtained if there were no errors in measurement. Hoffman’s true lap time is the expected lap time over an infinite number of independent timings. (Imagine an infinite number of people watching, each with their stopwatch out.)

But we never observe Hoffman’s true lap time, only an observed time (let’s call it X). The assumption of classical test theory is that an observed time is equal to the true time plus some error:

\[ X = T + e \]

Error of measurement

This is simple algebra, but if a score is the combination of a true score plus error, then any difference between an observed score time and the true score is an error of measurement (we’re still talking about that one lap that Hoffman swam)

\[ e = X - T \]

The timers’ errors might be too high or too low. Most of the errors will be pretty small, but sometimes the times can be way off – with large errors of measurement.

Sampling Error

Most familiar example: political polls – sampling voters prior to an election.

Also:

sampling students for NAEP
sampling schools for NAEP or PISA

Correlation and error

How is correlation related to error? - How strongly are two variables related to each other? - How consistent is performance on two tests?

Why should you care?

Many policy decisions require understanding the strength of relationships with scores.
Example, is the correlation of SAT scores with college performance strong enough to justify testing?

Correlations are essential for understanding reliability

Positive, negative, uncorrelated

Positive correlation: high scores on one variable are associated with high scores a second variable.

an easy example: height and weight

Negative correlation: high scores on one variable are associated with low scores on a second variable

an easy example: Tobacco consumption and life expectancy

Uncorrelated: no systematic relationship between scores on the two variables

Two types of correlations

Pearson (the most common; used this for most illustrations in class)

2 continuous variables
variables should be interval, but this assumption is sometimes relaxed

Point-biserial

One continuous variable, one dichotomous
example: correlation of height with gender
example: test score with right-wrong on an MC item

Uncorrelated variables:

library(ggplot2)
Jenny <- runif(100) # uniform random distribution between 0 & 1
Kelly <- runif(100)
Elena <- data.frame(Jenny,Kelly)
ggplot(Elena, aes(Jenny,Kelly)) + geom_point(shape = 1) + theme(aspect.ratio = 1)

Notice that aspect.ratio = 1 – important for the reader.

Back to page 53

In our Thorndike reading, there’s a bit called “Measures of Relationship” starting on page 53. We’ve talked about this before a bit, and we’ve looked at scatterplots of two variables (like our Table 2.1 students)

library(tidyverse)
library(readxl)
# READ IN DATA 
Table.2.1 <- read_csv(file = "Table.2.1_clean.csv")

Figure 2-9 (page 53)

Scatterplot of reading and mathematics scores

ggplot(Table.2.1, aes(x = Math, y = Reading)) + geom_point() + theme(aspect.ratio = 1)

Correlation coefficient

“As an index of degree of relationship, a stastic known as the correlation coefficient is widely used. The symbol r is used to designate this coefficient.” (Thorndike, p. 53)

A command in R: cor(xvariable, yvariable)

cor(Table.2.1$Math, Table.2.1$Reading)

[1] 0.6222243

What is the relationship between math and reading in our sample of 52 students?

Answer: r = 0.62 Positive correlation: students with higher math scores also have higher reading scores

Summary points about correlations

Correlations express the strength of a relationship (or association)
Correlations are standardized, so they are not affected by original scale
Correlations are bounded between -1 and +1
Even test scores on unrelated subjects tend to show moderate to high correlations (math and reading on our Table 2.1 kids)
Scores on two similar Spelling tests are likely to be highly correlated

Correlation between two tests

What if students took two spelling tests with 80 words per test? In this new dataset, the original spelling test is Spelling1, and there’s a new column called Spelling2

TwoSpelling <- read_excel("TwoSpelling.xlsx")
TwoSpelling

# A tibble: 52 × 4
   `First Name` `Last Name` Spelling1 Spelling2
   <chr>        <chr>           <dbl>     <dbl>
 1 Aaron        Andrews            64        65
 2 Byron        Biggs              64        62
 3 Charles      Cowen              40        45
 4 Donna        Davis              74        73
 5 Erin         Edwards            69        70
 6 Fernando     Franco             67        60
 7 Gail         Galaraga           71        70
 8 Harpo        Henry              51        55
 9 Irrida       Ignacio            68        70
10 Jack         Johanson           56        50
# ℹ 42 more rows

Descriptive Statistics

Descriptive Statistics of two 80-item spelling tests

summary(TwoSpelling)

  First Name         Last Name           Spelling1       Spelling2    
 Length:52          Length:52          Min.   :38.00   Min.   :37.00  
 Class :character   Class :character   1st Qu.:51.00   1st Qu.:50.00  
 Mode  :character   Mode  :character   Median :57.00   Median :58.00  
                                       Mean   :57.15   Mean   :56.79  
                                       3rd Qu.:64.00   3rd Qu.:65.00  
                                       Max.   :76.00   Max.   :73.00

sd(TwoSpelling$Spelling1)

[1] 9.350232

sd(TwoSpelling$Spelling2)

[1] 9.622964

Scatterplot of two spelling tests

ggplot(TwoSpelling, aes(x = Spelling1, y = Spelling2)) + 
  geom_point() + theme(aspect.ratio = 1)

Between the two tests

The relationship between the two tests is positive
The relationship between the two tests is linear
The correlation coefficient between the two tests:

cor(TwoSpelling$Spelling1, TwoSpelling$Spelling2)

[1] 0.8574511

Consistency

How consistent is performance on two tests?
Correlation coefficients are one way to quantify this relationship.
Correlations are essential for understanding reliability.
The full name of this relationship is the Pearson product-moment correlation.

Reliability

The concept of “reliability” in measurement is an expansion of the statistical concept of error.

In laypersons terms, error might refer to bias.
In statistics, error might refer to systematic errors over many measurements.
Or, error might refer to imprecision or uncertainty.

Imprecision and uncertainty are the focus here.

The CTT Assumptions:

Each person has a true score – T – that we would see if there were no errors in measurement.
Statistical definition of T: The expected score over an infinite number of independent administrations of a test.
We can’t observe a person’s true score, only an observed score, X.
That observed score is equal to the true score plus some error. (The error can be positive or negative.)

CTT Key Concepts:

A true score is the expected value (long-run average) of a person’s scores over replications.
Error is the difference between the observed and true scores in any single replication.
Standard error of measurement represents the average standard deviation of errors.
Reliability coefficient describes the expected correlation (again, long-run average) of a person’s scores between replications.

What is true?

# A tibble: 52 × 4
   `First Name` `Last Name` Spelling1 Spelling2
   <chr>        <chr>           <dbl>     <dbl>
 1 Aaron        Andrews            64        65
 2 Byron        Biggs              64        62
 3 Charles      Cowen              40        45
 4 Donna        Davis              74        73
 5 Erin         Edwards            69        70
 6 Fernando     Franco             67        60
 7 Gail         Galaraga           71        70
 8 Harpo        Henry              51        55
 9 Irrida       Ignacio            68        70
10 Jack         Johanson           56        50
# ℹ 42 more rows

After test 1, our best estimate of Aaron’s true spelling score was 64.

After test 2 and getting a 65, our best estimate of Aaron’s true score is now 64.5

Generally, and a bit stat-y

The true score is the expected value of observed scores over replications: \[ T = E(X) \] - 𝐄(X), the expected value over replications, should not to be confused with the error term, 𝑒. (This is understandably confusing.)

More stat-y

The expected value of errors is 0: \[ E(e) = 0 \] The covariance between true scores and errors is zero (no relationship): \[ Cov(T, e) = 0 \]

Writing the CTT equation in variances

\[ X = T + e \]

What you observe (Χ) is a combination of a “true” score (T) plus some error of measurement (e)

And assuming the errors are independent and random, we can use the equation to show that the variance in observed scores is made up of variance in the true scores (true differences between people) plus the variance in the errors of measurement.

\[ SD^2_{x} = SD^2_{T} + SD^2_{e} \]

\[Variance(X) = Variance(T) + Variance(e)\]

Variance?

Page 48 of Thorndike: “The variance is defined as the mean of the squared deviations from the mean…”
Variance of Spelling scores for Table 2-1 kids:

var(Table.2.1$Spelling)

[1] 87.42685

NOTE: Variance is a number that is NOT in the scale of the spelling test.

Standard Deviation

Standard deviation we DO know! It’s a statistic that IS on the scale of the test, and it’s the square root of the variance

sqrt(var(Table.2.1$Spelling))

[1] 9.350232

sd(Table.2.1$Spelling)

[1] 9.350232

So then we can imagine that two thirds of the kids scored between about 9 points above the mean and 9 points below the mean.

OK, back to the two spelling tests

Why did I show you the correlation between the two tests?

This correlation, when it’s two measures of the same trait like spelling, is the reliability coefficient.

And the reliability coefficient is an estimate of the proportion of variance in the observed scores that is due to true differences between students.

\[ r_{tt} = \frac{SD^2_T}{SD^2_X} \]

Using the data we have

We can use the data that we have to estimate the reliability of either of our spelling tests:

First: we’ve already calculated the correlation coefficient

cor(TwoSpelling$Spelling1, TwoSpelling$Spelling2)

[1] 0.8574511

Standard deviation of scores

Do we know the standard deviation of the scores? In fact, we have two estimates of the SD of spelling:

sd(TwoSpelling$Spelling1)

[1] 9.350232

sd(TwoSpelling$Spelling2)

[1] 9.622964

AvgSpellSD <- (sd(TwoSpelling$Spelling1) + sd(TwoSpelling$Spelling2))/2 
AvgSpellSD

[1] 9.486598

So we’ll use 9.49 as our estimate

What is the Standard Error of Measurement?

The formula for the standard error of measurement is:

\[ SD_{e} = SD_{x} \sqrt{(1 - r_{tt})} \]

SEM <- AvgSpellSD*sqrt(1 - cor(TwoSpelling$Spelling1, TwoSpelling$Spelling2))
SEM

[1] 3.581727

Standard Error of Measurement

In our example, the standard error of measurement for our two spelling tests is 3.58, and it’s measured in the same scale.

In this example, our SEM is in the scale of “words spelled correctly”.

Does SEM = 3.58 pass the sniff test?

Our standard error of measurment (our SEM of 3.58) is measured as the same scale as the test: items correctly answered on the 80-word spelling tests.

Does this seem reasonable?

If a kid took spelling test #1, and then took spelling test #2, would they score within 3 or 4 points above or below their score?

Let’s look at the differences between tests

Here, I add a column (using mutate) calculating the difference between the two spelling test scores

TwoSpelling <- TwoSpelling |>
  mutate(Difference = Spelling2 - Spelling1)
TwoSpelling

# A tibble: 52 × 5
   `First Name` `Last Name` Spelling1 Spelling2 Difference
   <chr>        <chr>           <dbl>     <dbl>      <dbl>
 1 Aaron        Andrews            64        65          1
 2 Byron        Biggs              64        62         -2
 3 Charles      Cowen              40        45          5
 4 Donna        Davis              74        73         -1
 5 Erin         Edwards            69        70          1
 6 Fernando     Franco             67        60         -7
 7 Gail         Galaraga           71        70         -1
 8 Harpo        Henry              51        55          4
 9 Irrida       Ignacio            68        70          2
10 Jack         Johanson           56        50         -6
# ℹ 42 more rows

Sort by size of difference

print(n = 52, arrange(TwoSpelling, Difference))

# A tibble: 52 × 5
   `First Name` `Last Name`  Spelling1 Spelling2 Difference
   <chr>        <chr>            <dbl>     <dbl>      <dbl>
 1 Salim        Salik               76        63        -13
 2 Zebulon      Zibberits           73        61        -12
 3 Moe          Mastrioni           58        49         -9
 4 Laverne      Lappenski           57        49         -8
 5 Nathan       Natts               47        39         -8
 6 Victor       Vaszquez            68        60         -8
 7 Fernando     Franco              67        60         -7
 8 Jack         Johanson            56        50         -6
 9 Jill         Johanson            61        55         -6
10 Petula       Peters              64        60         -4
11 Nancy        Nowits              44        40         -4
12 William      Westerbeke          54        50         -4
13 Quincy       Quirn               48        45         -3
14 Thelma       Thwaites            43        40         -3
15 Byron        Biggs               64        62         -2
16 Kleven       Klipsch             51        49         -2
17 Donna        Davis               74        73         -1
18 Gail         Galaraga            71        70         -1
19 Bellinda     Brown               38        37         -1
20 Dominik      Dubrow              66        65         -1
21 Igor         Ivanovich           53        52         -1
22 Orden        Orford              53        52         -1
23 Sally        Stebbens            51        50         -1
24 Thomas       Tank                65        65          0
25 Xenum        Xerxes              54        54          0
26 Charlotta    Cowen               47        47          0
27 Erik         Eriksen             55        55          0
28 Petre        Popovich            52        52          0
29 Rhonda       Rostropovich        50        50          0
30 Aaron        Andrews             64        65          1
31 Erin         Edwards             69        70          1
32 Usaka        Urban               65        66          1
33 Yuan         Young               59        60          1
34 Francis      French              59        60          1
35 Kaleen       Knowles             55        56          1
36 Velma        Vauter              49        50          1
37 Irrida       Ignacio             68        70          2
38 Mary         Madison             68        70          2
39 Hillary      Huan                61        63          2
40 Larry        Lewis               40        42          2
41 Zephina      Zoro                47        50          3
42 Harpo        Henry               51        55          4
43 Angela       Ash                 64        68          4
44 Charles      Cowen               40        45          5
45 Quadra       Quickly             44        49          5
46 Xena         Xerxes              57        62          5
47 Yannita      Younts              63        68          5
48 Rahim        Roberts             64        70          6
49 Wakana       Watanabe            53        60          7
50 Uriah        Urdahl              61        68          7
51 Guido        Garcia              52        60          8
52 Oprah        Oates               59        72         13

Plausible to us?

I sorted the difference from lowest to highest. The difference in Salim’s scores was pretty big, as they did way worse the second time.
But if you look and see, all the kids from Thelma to Zephina scored within 3 points of their first score.
That’s 29 of the 52 kids within 3. And 34 of the 52 kids scored within 4 points of their other test score.

I vote “pass” on the sniff test of my estimate of the standard error of measurement.

Is SEM = 3.58 normal?

ggplot(TwoSpelling, aes(Difference)) + geom_histogram(color='black', fill = 'green', bins = 22) +
geom_vline(xintercept = -3.58) +
geom_vline(xintercept = 3.58)

How about a density plot?

ggplot(TwoSpelling, aes(Difference)) +   geom_density() +
geom_vline(xintercept = -3.58) +
geom_vline(xintercept = 3.58)