Faces Project Log

This file contains a log of the progress on the faces project. Each log-entry contains:



Date: Thursday, January 7, 2020

Topic: Bivariate Regression of MTurk Features and Non-Linearity in p_hat_cnn

Overview: An overview of what the markdown contains

  1. A set of bivariate regressions checking that our MTurk features correlate with arrest-outcome as suggested by the literature
  • Combined male+female regression of attractiveness, competence, dominance, and trustworthiness
    • Each label is regressed individually and once combined
  • Subsample by gender
  • Subsample by ethnicity (black vs. non-black)
  • Subsample on higher detailed labels
  • (In Appendix) Subsample of females with race-controll (as this is the only one signficant)
  • (In Appendix) Sanity check for skin-tone relationship to race label
  1. Checking non-linearity in p_hat_cnn
  • Fixing coding error with p_hat_cnn_decile_average
  • Raw coding integers 1-10 for decile
  • Including higher order terms

Link: Links to two separate markdowns for

  1. Bivariate MTurk Labels: https://rpubs.com/JonasKnecht/bivariate_mturk_regressions

  2. Non-linearity in p_hat_cnn: https://rpubs.com/JonasKnecht/non_linearity_updated

Note I kept these separate just because one is a simple sanity check and the other has potentially more next steps and I want to cut down on the length of these output files to make them a little more comprehensible.

Summary

  1. Bivariate Regressions (See Link 1):

Combined male+female regressions still yield no significant results for any individual MTurk label or the combined model on release. The same holds for the male subset. For females we see that attractiveness, dominance, and trustworthiness are significant on their own, but not when combined. This is in-line with what we observed in the previous log. All effects are insignificant for the black vs. non-black subsets. However, controlling for race, the female subset effects become more significant with similar coefficient magnitudes.

Repeating these regressions for the well-labeled subset results in no significant effects for females. This may well be due to the lack of data, as we are restricting to 101 observations for the female subset. However, as this had been significant in the previous log, with the inclusion of all other regression covariates, I am a little puzzled by this ! The combined male+female model sees significant effects for competence and trustworthiness on the well labeled subset. We notice the same significance with the black-subset on the well-labeled set.

All in all these regressions are a bit puzzling, but seem to suggest that there is some signal stemming from these labels.

  1. Skin-tone and race sanity check (See Link 1):

To make sure our skin-tone labels make sense I plot their respective relationship to our race indicator. All seems to be in order here with darker skin tones receiving higher proportions of the black label.

  1. Non-linearity (See Link 2):

The main point is that I found my mistake and re-running the decile-average regression now yields significant results with a coefficient similar to the one seen previously. I also went ahead and ran a regression with higher order terms, none of which appear significant.

Next Steps

  1. Re-configuring the CNN data-loader to adjust for the imbalance in our sample.
    • Need to adjust for the imbalance in our data as most people are released
    • Write a new sampler for the training data and ensure the validation data is unchanged
  2. Increase MTurk label collection with larger number of workers per image
    • As we see some signal coming from the labels we want more detail on them
    • Increase the number of workers from 3 to 6 per image
  3. Re-run the baseline regressions with higher detail MTurk labels


Date: Tuesday, January 5, 2020

Topic: Regression Sanity Checks

Overview: An overview of what the markdown contains

This analysis focused on extending the primary regression of risk, skin_tone, age, MTurk labels, covariates and p_hat_cnn on release outcome. Below is an overview of the different elements:

  1. Per-Decile regressions:
  • Running our primary regression on subsamples of the risk_pred_prob distribution
  • Rows are grouped into decile 1-3, 4-6, 7-10
  1. Investigating non-linearity in p_hat_cnn:
  • Replacing p_hat_cnn by the per-decile mean
  • Replacing p_hat_cnn by the per-decile mean and collapsing decile 1-3
  • Replacing p_hat_cnn by a decile indicator (1-10)
  1. Repeating main regression on a subset of detailed MTurk labels:
  • Restricting our data to those with more than 3 MTurk labelers.
  • Between 6-9 workers per image leaving 499 observations
  • Split further between male (_N = 348) and female (_N = 101)
  1. Investigating relationship between skin-tone, MTurk-features, and arrest-rate
  • We plot:
    • Mean arrest rate vs. skin-tone
    • Mean attractiveness vs. skin-tone
    • Mean competence vs. skin-tone
    • Mean dominance vs. skin-tone
    • Mean trustworthiness vs. skin-tone

Link: https://rpubs.com/JonasKnecht/updated_regressions

Summary: A summary of the conclusions drawn from the analysis

  1. Per-Decile Regressions

We primarily note that p_hat_cnn remains significant at all levels. Interestingly the magnitude of the coefficient increases with the decile of risk_pred_prob we control for. This would indicate that at higher underlying risk, as indicated by our xg_boost model, the CNN is able to account for more variance in the judges decision. The coefficient on p_hat_cnn increases from: 0.317 at deciles 1-3, to 0.367 at deciles 4-6, to 0.514 at deciles 7-10. We also note similar increases in explanatory power when looking at the relative changes in adjusted R-squared.

  1. Non-linearity in p_hat_cnn

Here we were prompted to account for the non-linearity in p_hat_cnn as indicated by the decile plots (top of the markdown). Including the p_hat_decile_average suggests no significant effect, something which is likely an error in the code and will be looked at further. Similarly collapsing the bottom three deciles yields no results, and the inclusion of simple 1-10 values also remains insignificant. Looking into this further is going to be a next step !

  1. Repeating regressions with more MTurk detail:

Repeating our main regression on a subset of individuals with higher-accuracy MTurk labels (i.e. those with more workers-per-image) yields no significant results for any of our labels on a combined male + female data-set. The labels in question are: attractiveness, competence, dominance, and trustworthiness. These regressions include our MTurk labels together with our other arrest covariates, skin-tone, and p_hat_cnn. We also notice that p_hat_cnn is now insignificant, which is most likely due to the reduction in data.

Controlling for gender yields no significant results for males. For the female population, which includes only 101 observations, we notice that attractiveness and trustrworthiness are now significant at the 5 and 10 % level respectively. This was intended as a proof of concept on whether it would be worth increasing the number of labelers per image, i.e. whether we are likely to see any significant signal coming from these features. The female subset suggests that this is indeed the case and we go forward with future MTurk surveys increasing the number of workers per image from 3 to 6.

  1. Skin-tone plots:

Skin tone seems to have no direct relation to arrest-rate. Attractiveness, trustworthiness, and competence are highly positively correlated, with lighter skin tones receiving higher scores on both labels. Dominance is highly negatively correlated to skin-tone with lighter skin receiving lower scores. This is in-line with other literature findings.

Next-Steps: A summary of what we are investigating next

  1. Checking the relationship between our MTurk labels and release rate. We want this to conform to the literature which indicates that attractiveness should be significant.
  2. Set of bivariate regressions of the MTurk labels on release
  3. Include controls for race and gender
  4. Look again at the non-linearity in p_hat_cnn and make sure there are no coding mistakes
  5. Re-configuring the CNN data-loader to adjust for the imbalance in our sample.