For year 3, we have 144 IFPs, and we want to observe the inter item reliability for those as measures of overall forecaster ability.
We use individuals’ scores on all IFPs that they forecasted, and calculate chronbach’s alpha for each question. This requires pairwise deletion because most forecasters do not participate on all questions. We also observe item-total correlation, and look for questions that may be affecting reliability adversely.
The distribution of individual scores vs. overall scores for all surveys logit on year 3 so far.
We can see most scores are quite low, and there is a long right-hand tail for less-than-ideal outcomes. This skew leads us to consider a logit transformation in our reliability calculations. We will compare the results from both calculations.
If we use untransformed scores, we obtain a chronbach’s alpha of 0.954, or 0.968 standardized based on item-total correlations.
This compares to an alpha of 0.989, or 0.987 standardized, if we use Logit-transformed scores.
Reliability calculations reveal 22 of 144 IFPs that would improve reliability, even marginally, if removed: 1318-2, 1347-2, 1159-0, 1220-0, 1315-2, 1310-0, 1300-6, 1340-0, 1350-0, 1234-2, 1369-2, 1357-0, 1277-2, 1243-6, 1359-6, 1319-0, 1256-6, 1396-6, 1269-1, 1275-2, 1268-6, 1321-0. Reliability scores calculated with Logit transformed scores identify 30.
| ifp_id | raw_alpha | std.alpha | G6(smc) | average_r | S/N | alpha se | diff | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1318-2 | 0.96 | 0.97 | 1.00 | 0.18 | 31.72 | 0.00 | 0.00 |
| 2 | 1347-2 | 0.96 | 0.97 | 1.00 | 0.18 | 31.57 | 0.00 | 0.00 |
| 3 | 1159-0 | 0.96 | 0.97 | 1.00 | 0.18 | 31.39 | 0.00 | 0.00 |
| 4 | 1220-0 | 0.96 | 0.97 | 1.00 | 0.18 | 31.44 | 0.00 | 0.00 |
| 5 | 1315-2 | 0.96 | 0.97 | 1.00 | 0.18 | 31.34 | 0.00 | 0.00 |
| 6 | 1310-0 | 0.96 | 0.97 | 1.00 | 0.18 | 31.29 | 0.00 | 0.00 |
| 7 | 1300-6 | 0.96 | 0.97 | 1.00 | 0.18 | 31.13 | 0.00 | 0.00 |
| 8 | 1340-0 | 0.96 | 0.97 | 1.00 | 0.18 | 31.07 | 0.00 | 0.00 |
| 9 | 1350-0 | 0.96 | 0.97 | 1.00 | 0.18 | 31.02 | 0.00 | 0.00 |
| 10 | 1234-2 | 0.95 | 0.97 | 1.00 | 0.18 | 30.99 | 0.00 | 0.00 |
| 11 | 1369-2 | 0.95 | 0.97 | 1.00 | 0.18 | 31.05 | 0.00 | 0.00 |
| 12 | 1357-0 | 0.95 | 0.97 | 1.00 | 0.18 | 31.02 | 0.00 | 0.00 |
Among untransformed scores, we see there are 15 of 144 IFPs with correlations between -0.2 and 0.2: 1340-0,1256-6,1369-2,1350-0,1357-0,1234-2,1243-6,1268-6,1359-6,1277-2,1396-6,1319-0,1321-0,1269-1,1275-2. With logit scores, we observe 12 such IFPs. Remember that these correlations use pairwise deletion due to substantial amounts of missing data.
| ifp_id | n | r | r.cor | r.drop | mean | sd | |
|---|---|---|---|---|---|---|---|
| 1 | 1300-6 | 124.00 | -0.23 | -0.23 | -0.25 | 0.59 | 0.36 |
| 2 | 1340-0 | 419.00 | -0.17 | -0.17 | -0.22 | 0.37 | 0.36 |
| 3 | 1256-6 | 556.00 | -0.16 | -0.16 | -0.17 | 0.32 | 0.14 |
| 4 | 1369-2 | 282.00 | -0.16 | -0.16 | -0.19 | 0.48 | 0.31 |
| 5 | 1350-0 | 255.00 | -0.14 | -0.14 | -0.15 | 0.49 | 0.39 |
| 6 | 1357-0 | 204.00 | -0.13 | -0.13 | -0.10 | 1.06 | 0.34 |
| 7 | 1234-2 | 421.00 | -0.11 | -0.11 | -0.13 | 0.64 | 0.37 |
| 8 | 1243-6 | 395.00 | -0.06 | -0.06 | -0.07 | 0.83 | 0.24 |
| 9 | 1268-6 | 422.00 | -0.02 | -0.02 | 0.00 | 0.36 | 0.18 |
| 10 | 1359-6 | 307.00 | -0.01 | -0.01 | -0.00 | 0.89 | 0.27 |
| 11 | 1277-2 | 272.00 | -0.00 | -0.00 | -0.02 | 0.43 | 0.32 |
| 12 | 1396-6 | 228.00 | 0.08 | 0.08 | 0.06 | 0.57 | 0.29 |
| 13 | 1319-0 | 151.00 | 0.10 | 0.10 | 0.09 | 0.50 | 0.35 |
| 14 | 1321-0 | 362.00 | 0.16 | 0.16 | 0.15 | 0.66 | 0.26 |
| 15 | 1269-1 | 359.00 | 0.16 | 0.16 | 0.16 | 0.38 | 0.37 |
| 16 | 1275-2 | 320.00 | 0.17 | 0.17 | 0.13 | 0.45 | 0.32 |
| 17 | 1388-6 | 239.00 | 0.22 | 0.22 | 0.20 | 0.23 | 0.17 |
| 18 | 1341-0 | 307.00 | 0.23 | 0.23 | 0.22 | 0.32 | 0.31 |
If we consider IFPs identified with both transformed and untransformed scores, there are 32 potentially unreliable IFPs: 1318-2, 1347-2, 1159-0, 1220-0, 1315-2, 1310-0, 1300-6, 1340-0, 1350-0, 1234-2, 1369-2, 1357-0, 1277-2, 1243-6, 1359-6, 1319-0, 1256-6, 1396-6, 1269-1, 1275-2, 1268-6, 1321-0, 1216-0, 1341-0, 1267-0, 1308-6, 1261-6, 1251-0, 1311-6, 1388-6, 1222-6, 1304-2.
Looking at the reliability measures overall, we see that alpha and item-total correlation are highly interralted with Brier scores. This underscores the importance and difficulty of isolating questions that are truly unreliable and not simply difficult.
In the plots below:
Using Untransformed Scores:
With Logit Transformed Scores: *(All P-Values are well below
.001)
It seems unnecessarily risky to disregard all of our worst-scoring IFPs when ranking participants. In so doing, we could potentially be assigning too much weight, or rewarding people, for overly-aggressive strategies. As I mentioned, Brier scores are intended to magnify the mistakes inherent in taking risks, and to ignore them is to manipulate the scoring method itself.