This file contains a log of the progress on the faces project. Each log-entry contains:
Overview: An overview of what the markdown contains
This update contains links to the baseline regressions from before, with an added section for definitions, as well as a link to two different configurations of table 01.
p_hat_cnn. The columns then include the two other major predictors: predicted risk and charge characteristics. The 95% C.I. is included (though arguably not in the best format). These were obtained via Bootstrapping.Link:
Baseline Regression Summary: https://rpubs.com/JonasKnecht/regression_summary
Robustness Summary: https://rpubs.com/JonasKnecht/robustness_summary
Table 01 configurations: https://rpubs.com/JonasKnecht/table01
Next Steps
p_hat_cnn which will be good to compare to the results aboveOverview: An overview of what the markdown contains
With these two markdowns I have split the output for our baseline regressions and our robustness analysis. This should hopefully make it easier to piece together what is important. The important headings are also in red. Here is an outline of their contents:
Link:
Baseline Regression Summary: https://rpubs.com/JonasKnecht/regression_summary
Robustness Summary: https://rpubs.com/JonasKnecht/robustness_summary
Next Steps
p_hat_cnn which will be good to compare to the results aboveOverview: An overview of what the markdown contains
p_hat_cnn vs. the old models predictions. Since increased spead should help with the linearity in the decile plots, I present a comparison of those as well.Link:
Summary
Prediction Distribution & Deciles: Here we can clearly see a massive improvement over the older model, with much more weight being given to the LHS of the original p_hat_cnn distribution. This would indicate that the new minority_class_sampler function works well. Notice that this new CNN was trained with the scaling parameter (which I coded to be the target ratio of majority:minority) set to 2, from the original 3.6 observed in the raw data (i.e. for every personal actually jailed, there are approx. 3.6 not). The distribution plot is the most striking evidence that the new data-loader worked as intended (Eureka !!). The decile plots tell a similar story, in that the new model is strikingly linear and as Decile Plot A - NEW MODEL indicates, the gap between successive deciles is much reduced.
Baseline Regressions: Repeating our baseline regressions with these new CNN predictions doesn’t change the significance of our effect. We do however see a decrease in the coefficient value on p_hat_cnn from 0.403 (in our last baseline) to 0.380. Interestingly, the gap in adj-R-squared actually increases in the new model from 0.112 --> 0.123. The same holds true for the female/male split, with the female coefficient still being greater than the male coefficient. Both are reduced w.r.t. our last baseline. All in all, with the adjustment for the spread of p_hat_cnn we seen to have reduced our model coefficient across the board (given that the linear model is now a much better fit this is not surprising). Note This new data, and the reduction in regression coefficients, is entirely in-line with what we observed in the control for non-linearity !
Non-linearity check: Repeating the non-linearity regressions we notice a much smaller coefficient reduction than seen with our previous CNN data. This links back to the summary above.
Next Steps
p_hat_cnn which will be good to compare to the results aboveOverview: An overview of what the markdown contains
18 skin-tones, age, MTurk labels, p_hat_cov, and p_hat_cnn.Link:
Summary
We notice no changes in the overall regression significance of our baseline regression. The 4 MTurk labels are still all insignificant. The coefficient on p_hat_cnn decreases from 0.415 to 0.403 and remains very significant. Splitting by gender reveals that for the female subsample the dominance label becomes significant at the 5 and 10% level. The male subsample is entirely insignificant.
The MTurk label regressions are fully insignificant on the entire male+female sample. Neither the individual labels nor the combined labels are significant at all. On the female subsample we see that all 4 labels are significant. (This is pretty big news !!!) All four labels display positive coefficients and attractiveness is significant at the 1% level. Dominance and Trustworthiness are both significant at the 5% level and Competenceat the 10% level. The male subsample has no significant coefficients. Splitting by race shows no significant coefficients for the black subsample. For non-black individuals we see that trsutworthiness is slighly significant at the 10% level.
Next Steps
p_hat_cnn which may be a good indication of what we can achieve with a re-written dataloaderOverview: An overview of what the markdown contains
We run the following regression on the final arrest outcome:
- Individual skin-tone effects
- Skin-tone divided into three categories (dark, medium, light)
- MTurk race label
Each of these includes:
- Control for gender (male/female split)
- Coefficient plots for each regression
Link:
Summary
-0.058 change in the arrest outcome for light females over dark femalesNext Steps
attractivenss which literature suggests should play a rolep_hat_cnn which may be a good indication of what we can achieve with a re-written dataloaderOverview: An overview of what the markdown contains
p_hat_cnnp_hat_cnn_decile_averageLink: Links to two separate markdowns for
Bivariate MTurk Labels: https://rpubs.com/JonasKnecht/bivariate_mturk_regressions
Non-linearity in p_hat_cnn: https://rpubs.com/JonasKnecht/non_linearity_updated
Note I kept these separate just because one is a simple sanity check and the other has potentially more next steps and I want to cut down on the length of these output files to make them a little more comprehensible.
Summary
Combined male+female regressions still yield no significant results for any individual MTurk label or the combined model on release. The same holds for the male subset. For females we see that attractiveness, dominance, and trustworthiness are significant on their own, but not when combined. This is in-line with what we observed in the previous log. All effects are insignificant for the black vs. non-black subsets. However, controlling for race, the female subset effects become more significant with similar coefficient magnitudes.
Repeating these regressions for the well-labeled subset results in no significant effects for females. This may well be due to the lack of data, as we are restricting to 101 observations for the female subset. However, as this had been significant in the previous log, with the inclusion of all other regression covariates, I am a little puzzled by this ! The combined male+female model sees significant effects for competence and trustworthiness on the well labeled subset. We notice the same significance with the black-subset on the well-labeled set.
All in all these regressions are a bit puzzling, but seem to suggest that there is some signal stemming from these labels.
To make sure our skin-tone labels make sense I plot their respective relationship to our race indicator. All seems to be in order here with darker skin tones receiving higher proportions of the black label.
The main point is that I found my mistake and re-running the decile-average regression now yields significant results with a coefficient similar to the one seen previously. I also went ahead and ran a regression with higher order terms, none of which appear significant.
Next Steps
Overview: An overview of what the markdown contains
This analysis focused on extending the primary regression of risk, skin_tone, age, MTurk labels, covariates and p_hat_cnn on release outcome. Below is an overview of the different elements:
risk_pred_prob distributionp_hat_cnn:p_hat_cnn by the per-decile meanp_hat_cnn by the per-decile mean and collapsing decile 1-3p_hat_cnn by a decile indicator (1-10)499 observations_N = 348) and female (_N = 101)Link: https://rpubs.com/JonasKnecht/updated_regressions
Summary: A summary of the conclusions drawn from the analysis
We primarily note that p_hat_cnn remains significant at all levels. Interestingly the magnitude of the coefficient increases with the decile of risk_pred_prob we control for. This would indicate that at higher underlying risk, as indicated by our xg_boost model, the CNN is able to account for more variance in the judges decision. The coefficient on p_hat_cnn increases from: 0.317 at deciles 1-3, to 0.367 at deciles 4-6, to 0.514 at deciles 7-10. We also note similar increases in explanatory power when looking at the relative changes in adjusted R-squared.
p_hat_cnnHere we were prompted to account for the non-linearity in p_hat_cnn as indicated by the decile plots (top of the markdown). Including the p_hat_decile_average suggests no significant effect, something which is likely an error in the code and will be looked at further. Similarly collapsing the bottom three deciles yields no results, and the inclusion of simple 1-10 values also remains insignificant. Looking into this further is going to be a next step !
Repeating our main regression on a subset of individuals with higher-accuracy MTurk labels (i.e. those with more workers-per-image) yields no significant results for any of our labels on a combined male + female data-set. The labels in question are: attractiveness, competence, dominance, and trustworthiness. These regressions include our MTurk labels together with our other arrest covariates, skin-tone, and p_hat_cnn. We also notice that p_hat_cnn is now insignificant, which is most likely due to the reduction in data.
Controlling for gender yields no significant results for males. For the female population, which includes only 101 observations, we notice that attractiveness and trustrworthiness are now significant at the 5 and 10 % level respectively. This was intended as a proof of concept on whether it would be worth increasing the number of labelers per image, i.e. whether we are likely to see any significant signal coming from these features. The female subset suggests that this is indeed the case and we go forward with future MTurk surveys increasing the number of workers per image from 3 to 6.
Skin tone seems to have no direct relation to arrest-rate. Attractiveness, trustworthiness, and competence are highly positively correlated, with lighter skin tones receiving higher scores on both labels. Dominance is highly negatively correlated to skin-tone with lighter skin receiving lower scores. This is in-line with other literature findings.
Next-Steps: A summary of what we are investigating next
p_hat_cnn and make sure there are no coding mistakes