Credit to Professor Dan Koretz of Harvard University for many of these explanations.
One of the most important concepts in measurement and statistics is the notion of error. And it appears in many contexts:
Chapter 7: Error and Reliability:
“How much we don’t know what we’re talking about”
Measurement Error?
Sampling Error?
Hoffman swims laps at the college pool. He’s not very fast, but wants to get better. So he enlists some people to help him improve, and they have him swim some 50s to see how fast he is and what needs to be done.
It’s is not always clear whether inconsistency is random or systematic.
For now, focus on truly random error, knowing that apparently random inconsistencies can stem from systematic as well as random processes.
Three timers at the ready. Hoffman swims 50 yards
The three timers averaged 51.79 with a standard deviation of 0.4
There is always some inconsistency in any measurement procedure. We don’t know precisely how close we are to the truth, but we assume that there is some true measurement.
Assume that Hoffman is all warmed up, and that he’s got enough rest between laps. He’s going to swim 50 yards five times, with three timers, and this is what we see:
Hoffman averaged 51.79, 50.76, 52.65, 52.88, 53.01
Classical test theory assumes there’s a true score (call it T) obtained if there were no errors in measurement. Hoffman’s true lap time is the expected lap time over an infinite number of independent timings. (Imagine an infinite number of people watching, each with their stopwatch out.)
But we never observe Hoffman’s true lap time, only an observed time (let’s call it X). The assumption of classical test theory is that an observed time is equal to the true time plus some error:
\[ X = T + e \]
This is simple algebra, but if a score is the combination of a true score plus error, then any difference between an observed score time and the true score is an error of measurement (we’re still talking about that one lap that Hoffman swam)
\[ e = X - T \]
The timers’ errors might be too high or too low. Most of the errors will be pretty small, but sometimes the times can be way off – with large errors of measurement.
Most familiar example: political polls – sampling voters prior to an election.
Also:
How is correlation related to error? - How strongly are two variables related to each other? - How consistent is performance on two tests?
Why should you care?
Many policy decisions require understanding the strength of relationships with scores.
Example, is the correlation of SAT scores with college performance strong enough to justify testing?
Correlations are essential for understanding reliability
Positive correlation: high scores on one variable are associated with high scores a second variable.
Negative correlation: high scores on one variable are associated with low scores on a second variable
Uncorrelated: no systematic relationship between scores on the two variables
Pearson (the most common; used this for most illustrations in class)
Point-biserial
Notice that aspect.ratio = 1 – important for the reader.
In our Thorndike reading, there’s a bit called “Measures of Relationship” starting on page 53. We’ve talked about this before a bit, and we’ve looked at scatterplots of two variables (like our Table 2.1 students)
Scatterplot of reading and mathematics scores
“As an index of degree of relationship, a stastic known as the correlation coefficient is widely used. The symbol r is used to designate this coefficient.” (Thorndike, p. 53)
A command in R: cor(xvariable, yvariable)
What is the relationship between math and reading in our sample of 52 students?
Answer: r = 0.62 Positive correlation: students with higher math scores also have higher reading scores
What if students took two spelling tests with 80 words per test? In this new dataset, the original spelling test is Spelling1, and there’s a new column called Spelling2
# A tibble: 52 × 4
`First Name` `Last Name` Spelling1 Spelling2
<chr> <chr> <dbl> <dbl>
1 Aaron Andrews 64 65
2 Byron Biggs 64 62
3 Charles Cowen 40 45
4 Donna Davis 74 73
5 Erin Edwards 69 70
6 Fernando Franco 67 60
7 Gail Galaraga 71 70
8 Harpo Henry 51 55
9 Irrida Ignacio 68 70
10 Jack Johanson 56 50
# ℹ 42 more rows
Descriptive Statistics of two 80-item spelling tests
First Name Last Name Spelling1 Spelling2
Length:52 Length:52 Min. :38.00 Min. :37.00
Class :character Class :character 1st Qu.:51.00 1st Qu.:50.00
Mode :character Mode :character Median :57.00 Median :58.00
Mean :57.15 Mean :56.79
3rd Qu.:64.00 3rd Qu.:65.00
Max. :76.00 Max. :73.00
[1] 9.350232
[1] 9.622964
How consistent is performance on two tests?
Correlation coefficients are one way to quantify this relationship.
Correlations are essential for understanding reliability.
The full name of this relationship is the Pearson product-moment correlation.
The concept of “reliability” in measurement is an expansion of the statistical concept of error.
Imprecision and uncertainty are the focus here.
# A tibble: 52 × 4
`First Name` `Last Name` Spelling1 Spelling2
<chr> <chr> <dbl> <dbl>
1 Aaron Andrews 64 65
2 Byron Biggs 64 62
3 Charles Cowen 40 45
4 Donna Davis 74 73
5 Erin Edwards 69 70
6 Fernando Franco 67 60
7 Gail Galaraga 71 70
8 Harpo Henry 51 55
9 Irrida Ignacio 68 70
10 Jack Johanson 56 50
# ℹ 42 more rows
After test 1, our best estimate of Aaron’s true spelling score was 64.
After test 2 and getting a 65, our best estimate of Aaron’s true score is now 64.5
The true score is the expected value of observed scores over replications: \[ T = E(X) \] - 𝐄(X), the expected value over replications, should not to be confused with the error term, 𝑒. (This is understandably confusing.)
The expected value of errors is 0: \[ E(e) = 0 \] The covariance between true scores and errors is zero (no relationship): \[ Cov(T, e) = 0 \]
\[ X = T + e \]
What you observe (Χ) is a combination of a “true” score (T) plus some error of measurement (e)
And assuming the errors are independent and random, we can use the equation to show that the variance in observed scores is made up of variance in the true scores (true differences between people) plus the variance in the errors of measurement.
\[ SD^2_{x} = SD^2_{T} + SD^2_{e} \]
or
\[Variance(X) = Variance(T) + Variance(e)\]
NOTE: Variance is a number that is NOT in the scale of the spelling test.
Standard deviation we DO know! It’s a statistic that IS on the scale of the test, and it’s the square root of the variance
So then we can imagine that two thirds of the kids scored between about 9 points above the mean and 9 points below the mean.
Why did I show you the correlation between the two tests?
This correlation, when it’s two measures of the same trait like spelling, is the reliability coefficient.
And the reliability coefficient is an estimate of the proportion of variance in the observed scores that is due to true differences between students.
\[ r_{tt} = \frac{SD^2_T}{SD^2_X} \]
We can use the data that we have to estimate the reliability of either of our spelling tests:
First: we’ve already calculated the correlation coefficient
Do we know the standard deviation of the scores? In fact, we have two estimates of the SD of spelling:
[1] 9.350232
[1] 9.622964
[1] 9.486598
So we’ll use 9.49 as our estimate
The formula for the standard error of measurement is:
\[ SD_{e} = SD_{x} \sqrt{(1 - r_{tt})} \]
In our example, the standard error of measurement for our two spelling tests is 3.58, and it’s measured in the same scale.
In this example, our SEM is in the scale of “words spelled correctly”.
Our standard error of measurment (our SEM of 3.58) is measured as the same scale as the test: items correctly answered on the 80-word spelling tests.
Does this seem reasonable?
If a kid took spelling test #1, and then took spelling test #2, would they score within 3 or 4 points above or below their score?
Here, I add a column (using mutate) calculating the difference between the two spelling test scores
# A tibble: 52 × 5
`First Name` `Last Name` Spelling1 Spelling2 Difference
<chr> <chr> <dbl> <dbl> <dbl>
1 Aaron Andrews 64 65 1
2 Byron Biggs 64 62 -2
3 Charles Cowen 40 45 5
4 Donna Davis 74 73 -1
5 Erin Edwards 69 70 1
6 Fernando Franco 67 60 -7
7 Gail Galaraga 71 70 -1
8 Harpo Henry 51 55 4
9 Irrida Ignacio 68 70 2
10 Jack Johanson 56 50 -6
# ℹ 42 more rows
# A tibble: 52 × 5
`First Name` `Last Name` Spelling1 Spelling2 Difference
<chr> <chr> <dbl> <dbl> <dbl>
1 Salim Salik 76 63 -13
2 Zebulon Zibberits 73 61 -12
3 Moe Mastrioni 58 49 -9
4 Laverne Lappenski 57 49 -8
5 Nathan Natts 47 39 -8
6 Victor Vaszquez 68 60 -8
7 Fernando Franco 67 60 -7
8 Jack Johanson 56 50 -6
9 Jill Johanson 61 55 -6
10 Petula Peters 64 60 -4
11 Nancy Nowits 44 40 -4
12 William Westerbeke 54 50 -4
13 Quincy Quirn 48 45 -3
14 Thelma Thwaites 43 40 -3
15 Byron Biggs 64 62 -2
16 Kleven Klipsch 51 49 -2
17 Donna Davis 74 73 -1
18 Gail Galaraga 71 70 -1
19 Bellinda Brown 38 37 -1
20 Dominik Dubrow 66 65 -1
21 Igor Ivanovich 53 52 -1
22 Orden Orford 53 52 -1
23 Sally Stebbens 51 50 -1
24 Thomas Tank 65 65 0
25 Xenum Xerxes 54 54 0
26 Charlotta Cowen 47 47 0
27 Erik Eriksen 55 55 0
28 Petre Popovich 52 52 0
29 Rhonda Rostropovich 50 50 0
30 Aaron Andrews 64 65 1
31 Erin Edwards 69 70 1
32 Usaka Urban 65 66 1
33 Yuan Young 59 60 1
34 Francis French 59 60 1
35 Kaleen Knowles 55 56 1
36 Velma Vauter 49 50 1
37 Irrida Ignacio 68 70 2
38 Mary Madison 68 70 2
39 Hillary Huan 61 63 2
40 Larry Lewis 40 42 2
41 Zephina Zoro 47 50 3
42 Harpo Henry 51 55 4
43 Angela Ash 64 68 4
44 Charles Cowen 40 45 5
45 Quadra Quickly 44 49 5
46 Xena Xerxes 57 62 5
47 Yannita Younts 63 68 5
48 Rahim Roberts 64 70 6
49 Wakana Watanabe 53 60 7
50 Uriah Urdahl 61 68 7
51 Guido Garcia 52 60 8
52 Oprah Oates 59 72 13
I sorted the difference from lowest to highest. The difference in Salim’s scores was pretty big, as they did way worse the second time.
But if you look and see, all the kids from Thelma to Zephina scored within 3 points of their first score.
That’s 29 of the 52 kids within 3. And 34 of the 52 kids scored within 4 points of their other test score.
I vote “pass” on the sniff test of my estimate of the standard error of measurement.