Investigating Inter-IFP Reliability

For year 3, we have 144 IFPs, and we want to observe the inter item reliability for those as measures of overall forecaster ability.

We use individuals’ scores on all IFPs that they forecasted, and calculate chronbach’s alpha for each question. This requires pairwise deletion because most forecasters do not participate on all questions. We also observe item-total correlation, and look for questions that may be affecting reliability adversely.

Distribution of Scores

The distribution of individual scores vs. overall scores for all surveys logit on year 3 so far.

plot of chunk unnamed-chunk-2 We can see most scores are quite low, and there is a long right-hand tail for less-than-ideal outcomes. This skew leads us to consider a logit transformation in our reliability calculations. We will compare the results from both calculations.

Analysis of Reliability

If we use untransformed scores, we obtain a chronbach’s alpha of 0.954, or 0.968 standardized based on item-total correlations.

This compares to an alpha of 0.989, or 0.987 standardized, if we use Logit-transformed scores.

Reliability

Reliability calculations reveal 22 of 144 IFPs that would improve reliability, even marginally, if removed: 1318-2, 1347-2, 1159-0, 1220-0, 1315-2, 1310-0, 1300-6, 1340-0, 1350-0, 1234-2, 1369-2, 1357-0, 1277-2, 1243-6, 1359-6, 1319-0, 1256-6, 1396-6, 1269-1, 1275-2, 1268-6, 1321-0. Reliability scores calculated with Logit transformed scores identify 30.

plot of chunk unnamed-chunk-3
Reliability
ifp_id raw_alpha std.alpha G6(smc) average_r S/N alpha se diff
1 1318-2 0.96 0.97 1.00 0.18 31.72 0.00 0.00
2 1347-2 0.96 0.97 1.00 0.18 31.57 0.00 0.00
3 1159-0 0.96 0.97 1.00 0.18 31.39 0.00 0.00
4 1220-0 0.96 0.97 1.00 0.18 31.44 0.00 0.00
5 1315-2 0.96 0.97 1.00 0.18 31.34 0.00 0.00
6 1310-0 0.96 0.97 1.00 0.18 31.29 0.00 0.00
7 1300-6 0.96 0.97 1.00 0.18 31.13 0.00 0.00
8 1340-0 0.96 0.97 1.00 0.18 31.07 0.00 0.00
9 1350-0 0.96 0.97 1.00 0.18 31.02 0.00 0.00
10 1234-2 0.95 0.97 1.00 0.18 30.99 0.00 0.00
11 1369-2 0.95 0.97 1.00 0.18 31.05 0.00 0.00
12 1357-0 0.95 0.97 1.00 0.18 31.02 0.00 0.00

Item-Total Correlations and Item Statistics

Among untransformed scores, we see there are 15 of 144 IFPs with correlations between -0.2 and 0.2: 1340-0,1256-6,1369-2,1350-0,1357-0,1234-2,1243-6,1268-6,1359-6,1277-2,1396-6,1319-0,1321-0,1269-1,1275-2. With logit scores, we observe 12 such IFPs. Remember that these correlations use pairwise deletion due to substantial amounts of missing data.

plot of chunk unnamed-chunk-4
Item Statistics
ifp_id n r r.cor r.drop mean sd
1 1300-6 124.00 -0.23 -0.23 -0.25 0.59 0.36
2 1340-0 419.00 -0.17 -0.17 -0.22 0.37 0.36
3 1256-6 556.00 -0.16 -0.16 -0.17 0.32 0.14
4 1369-2 282.00 -0.16 -0.16 -0.19 0.48 0.31
5 1350-0 255.00 -0.14 -0.14 -0.15 0.49 0.39
6 1357-0 204.00 -0.13 -0.13 -0.10 1.06 0.34
7 1234-2 421.00 -0.11 -0.11 -0.13 0.64 0.37
8 1243-6 395.00 -0.06 -0.06 -0.07 0.83 0.24
9 1268-6 422.00 -0.02 -0.02 0.00 0.36 0.18
10 1359-6 307.00 -0.01 -0.01 -0.00 0.89 0.27
11 1277-2 272.00 -0.00 -0.00 -0.02 0.43 0.32
12 1396-6 228.00 0.08 0.08 0.06 0.57 0.29
13 1319-0 151.00 0.10 0.10 0.09 0.50 0.35
14 1321-0 362.00 0.16 0.16 0.15 0.66 0.26
15 1269-1 359.00 0.16 0.16 0.16 0.38 0.37
16 1275-2 320.00 0.17 0.17 0.13 0.45 0.32
17 1388-6 239.00 0.22 0.22 0.20 0.23 0.17
18 1341-0 307.00 0.23 0.23 0.22 0.32 0.31

Which IFPs are potentially unreliable?

If we consider IFPs identified with both transformed and untransformed scores, there are 32 potentially unreliable IFPs: 1318-2, 1347-2, 1159-0, 1220-0, 1315-2, 1310-0, 1300-6, 1340-0, 1350-0, 1234-2, 1369-2, 1357-0, 1277-2, 1243-6, 1359-6, 1319-0, 1256-6, 1396-6, 1269-1, 1275-2, 1268-6, 1321-0, 1216-0, 1341-0, 1267-0, 1308-6, 1261-6, 1251-0, 1311-6, 1388-6, 1222-6, 1304-2.

plot of chunk unnamed-chunk-5

What does it mean?

Looking at the reliability measures overall, we see that alpha and item-total correlation are highly interralted with Brier scores. This underscores the importance and difficulty of isolating questions that are truly unreliable and not simply difficult.

In the plots below:

  • Blue points are considered reliable by both methods
  • Green points are unreliable with logit transformed scores
  • Red points are unreliable in both calculations

Using Untransformed Scores: plot of chunk unnamed-chunk-6

With Logit Transformed Scores: plot of chunk unnamed-chunk-7 *(All P-Values are well below .001)

What Should We Do?

It seems unnecessarily risky to disregard all of our worst-scoring IFPs when ranking participants. In so doing, we could potentially be assigning too much weight, or rewarding people, for overly-aggressive strategies. As I mentioned, Brier scores are intended to magnify the mistakes inherent in taking risks, and to ignore them is to manipulate the scoring method itself.