Arrests Project Log

This file contains a log of the progress on the faces project. Each log-entry contains:

An overview of the files contents
Links to the result-markdowns
A summary of updates
Overview of next steps

Date: Monday, March 15, 2021

Topic: Arrest Table 01 - Side by Side MTurk labels

One Sentence Summary

We repeat our table 01 regressions with updated side-by-side MTurk labels and observe an increase in adjusted r-squared signal from 0.0002 to 0.0012. However looking at the actual regressions tables the OLS coefficients on the MTurk labels remain insignificant. Splitting by gender yields higher results for females at 0.0045 but there are still no significant coefficients.

Overview: An overview of what the markdown contains

A general overview:

Table 01 for the combined sample with new side by side MTurk labels
The corresponding OLS regressions for the combined sample
Table 01 split by gender with 6562 in the male and 1808 in the female subsample
The corresponding OLS regressions for the split sample

Link:

Table 01 with side-by-side labels: https://rpubs.com/JonasKnecht/st_table01_mturk_sbs

Next Steps

Good news … since I have a working PyTorch pipeline for getting gradients in the CNN, next up I am turning to the GAN and computing gradients there
Once gradients for both the CNN and the GAN are computed, I will turn to adding the step-update algorithm and generating image strips

Date: Monday, March 8, 2021

Topic: FULL VALIDATION GAN-CNN Projector Test

One Sentence Summary

I repeat our projector regressions with the full arrest validation set. We see that using the projected images vs. their real counterpart reduced the adjusted r-sqrt of p_hat_cnn from 0.033 to 0.0298 which is very promising

Overview: An overview of what the markdown contains

A general overview:

I now take our entire validation set of real mugshots, find their GAN projections, and compare the CNN predictions for both groups.

P_hat_cnn density plot comparison: I first get predictions for the validation set real and projected images and then compare their respective p_hat_cnn distributions. Ideally these would line up closely.
P_hat_cnn correlation: Here I take the pairs of p_hat_cnn values for real (i.e. target) and projected images and check for their correlation. We want to see as close to perfect correlation as possible, which would indicate that the GANs projector is keeping all the signal the CNN is picking up on.
Table 01: Producing table 01 with each column representing the p_hat_cnn values from either the projections, targets, or a combination of both.

Link:

GAN Projections P_hat_cnn comparison: https://rpubs.com/JonasKnecht/arrest_projector_full_val_regressions

Next Steps

Setting up GAN-CNN environment: This is a technical task and boils down to making sure we have an environment on AWS in which both the GAN and CNN predictor can run and share information
Build Prototype face-editing algorithm: I want to be able to compute gradients and propagate back to the GANs latent space.

Date: Sunday, February 24, 2021

Topic: EXPERIMENT on PyTorch Generator

Overview: An overview of what the markdown contains

A general overview:

I first check that the seed in the TensorFlow and PyTorch model are the same. I just do this by producing images from the same seed and checking them manually. This markdown focuses on the second check, looking at the p_hat_cnn distributions of generated vs. real images.

I plot three things:

The baseline distribution on p_hat_cnn of our validation set mugshots. This is the distribution all other ones are compared to as these are the real mugshots
The truncated at 0.5 distribution of p_hat_cnn. This is the distribution of a sample of 2,000 “average” mugshots from the GANs latent space (so these are NOT real images). Here I notice a significant skew
The truncated at 0.8 distribution of p_hat_cnn. This is the same as above, but for 2,000 more “extreme” images. The styleGAN paper is a little vague in how the discuss the impact of truncation, but I will do more digging to figure out exactly what \(\psi\) does.

Link:

GAN Generator Test: https://rpubs.com/JonasKnecht/pytorch_generator_distributions

Date: Wednesday, February 24, 2021

Topic: UPDATED GAN-CNN Projector Test

Overview: An overview of what the markdown contains

A general overview:

I now take 2000 real mugshots from our validation set, find their GAN projections, and compare the CNN predictions for both groups.

P_hat_cnn density plot comparison: I first get predictions for the 2k real and projected images and then compare their respective p_hat_cnn distributions. Ideally these would line up closely.
P_hat_cnn correlation: Here I take the pairs of p_hat_cnn values for real (i.e. target) and projected images and check for their correlation. We want to see as close to perfect correlation as possible, which would indicate that the GANs projector is keeping all the signal the CNN is picking up on.
Table 01: Producing table 01 with each column representing the p_hat_cnn values from either the projections, targets, or a combination of both.

Link:

GAN Projections P_hat_cnn comparison: https://rpubs.com/JonasKnecht/arrest_projector_regressions

Next Steps

P_hat_cnn for generated images: I want to check what the distribution of randomly generated GAN mugshots looks like through the CNN. Again, since we will want to try an editing pipeline without a projector we will want these generated images to follow the p_hat_cnn distribution of a real sample of mugshots as closely as possible.
PyTorch Generator Weight Tranfer: Since I have a working PyTorch generator port I will run some tests to make sure the Tensorflow and PyTorch model produce the same image output.

Date: Thursday, February 18, 2021

Topic: First GAN-CNN Predictions Sanity Checks

Overview: An overview of what the markdown contains

A general overview:*

I finally have a working CNN prediction script with which I can take GAN generated or projected images and get a prediction for our release outcome. I now take 2000 real mugshots, find their GAN projections, and compare the CNN predictions for both groups.

P_hat_cnn density plot comparison: I first get predictions for the 2k real and projected images and then compare their respective p_hat_cnn distributions. Ideally these would line up closely.
P_hat_cnn correlation: Here I take the pairs of p_hat_cnn values for real (i.e. target) and projected images and check for their correlation. We want to see as close to perfect correlation as possible, which would indicate that the GANs projector is keeping all the signal the CNN is picking up on.

NOTE Sorry that this took so long, I had to write a new custom prediction script and that turned out to be quite challenging actually. There are still some problems with gradient trakcing, but the general prediction pipeline is now up and running.

Link:

GAN Projections P_hat_cnn comparison: https://rpubs.com/JonasKnecht/arrest_gan_predictions_01

Next Steps

P_hat_cnn for generated images: I want to check what the distribution of randomly generated GAN mugshots looks like through the CNN. Again, since we will want to try an editing pipeline without a projector we will want these generated images to follow the p_hat_cnn distribution of a real sample of mugshots as closely as possible.

Date: Friday, February 5, 2021

Topic: CNN Comparisons

Overview: An overview of what the markdown contains

This update contains links to the baseline regressions from before, with an added section for definitions, as well as a link to two different configurations of table 01.

P-hat-cnn distribution comparison: The first plot compares the p_hat_cnn distribution from different models, these are:

Baselin CNN: The ResNet-50 initially trained by Logan and Celia
Minority Class Sampling CNN: The ResNet-50 with the custom data sampler to deal with the imbalance in our training data
The Top 3 performing MNV-2 class models trained by Jim. These are labeld top, second, third according to their AUC performance.

Table 1 Configuration 1: I repeat our table 01 config 01 regressions using the p_hat_cnn values from all five different CNNs as well as an ensemble model.

Single Variable Model:

Demographics on release outcome
Arrest charge features on release outcome
XgBoost Predicted risk on release outcome
MTurk labels on release outcome including the mean and median
p_hat_cnn on release outcome

Multiple Variable Model:

Demographics + Charge Features in a combined model
Demographics + Charge Features in a combined model + predicted risk
Demographics + Charge Features + Risk + MTurk features (mean, median)
Demographics + Charge Features + Risk + p_hat_cnn
Kitchen Sink Model including: Demographics + Charge Features + Risk + p_hat_cnn + MTurk features

NOTE Please see the definition section at the top of the results output, together with the definitions above every table, for an overview of what RHS variables I am using.

Link:

CNN Comparison: https://rpubs.com/JonasKnecht/arrest_cnn_comparison

Next Steps

Re-Desinging MTurk Survey to fit the side-by-side comparison. For this we need to do some digging to allow us to group images by quality.

Date: Wednesday, February 3, 2021

Topic: TABLE 01 - Configuration 1 and 2 - Gender Splits & MTurk Moments

Overview: An overview of what the markdown contains

This update contains links to the baseline regressions from before, with an added section for definitions, as well as a link to two different configurations of table 01.

Table 1 Configuration 1: This table is based on the agreed upon outline. As such it displays the adjusted R^2 from regressions of individual variables and combined models on the final release outcome. The models are:

Single Variable Model:

Demographics on release outcome
Arrest charge features on release outcome
XgBoost Predicted risk on release outcome
MTurk labels on release outcome now including the mean and median
p_hat_cnn on release outcome

Multiple Variable Model:

Demographics + Charge Features in a combined model
Demographics + Charge Features in a combined model + predicted risk
Demographics + Charge Features + Risk + MTurk features (mean, median)
Demographics + Charge Features + Risk + p_hat_cnn
Demographics + Charge Features + Risk + p_hat_cnn + MTurk features

Table 1 Configuration 1: I now split this by gender. We finally see a signficant MTurk effect for this. The specifications are the same as above.
Table 1 Configuration 2: I don’t change any of the baseline models but include an additional row for additional MTurk moments, in the same way as above
Table 1 Configuration 1 Gender Interaction: I repeat Table 01 Config 01 and include an interaction term sex*MTurk on top of our baseline demographics model. We see a marginal increase from the MTurk labels.

NOTE Please see the definition section at the top of the results output, together with the definitions above every table, for an overview of what RHS variables I am using.

Link:

Table 01 configurations: https://rpubs.com/JonasKnecht/arrest_table01_edits

Next Steps

Different CNN predictions

Since Jim is finishing his CNN this week, I will re-run our regressions with his predictions
His work produced a much wider spread in p_hat_cnn which will be good to compare to the results above
I’m especially interested in seeing how his distribution compares to ours.

Re-Desinging MTurk Survey to fit the side-by-side comparison. For this we need to do some digging to allow us to group images by quality.

Date: Wednesday, January 20, 2021

Topic: TABLE 01 - Configuration 1 and 2

Overview: An overview of what the markdown contains

This update contains links to the baseline regressions from before, with an added section for definitions, as well as a link to two different configurations of table 01.

Table 1 Configuration 1: This table is based on the agreed upon outline. As such it displays the adjusted R^2 from regressions of individual variables and combined models on the final release outcome. The models are:

Single Variable Model:

Demographics on release outcome
Arrest charge features on release outcome
XgBoost Predicted risk on release outcome
MTurk labels on release outcome
p_hat_cnn on release outcome

Multiple Variable Model:

Demographics + Charge Features in a combined model
Demographics + Charge Features in a combined model + predicted risk
Demographics + Charge Features + Risk + MTurk features
Demographics + Charge Features + Risk + p_hat_cnn
Demographics + Charge Features + Risk + p_hat_cnn + MTurk features

Table 1 Configuration 2: This table is slightly different in that it displays the individual models incrementally with the addition of p_hat_cnn. The columns then include the two other major predictors: predicted risk and charge characteristics. The 95% C.I. is included (though arguably not in the best format). These were obtained via Bootstrapping.

Link:

Baseline Regression Summary: https://rpubs.com/JonasKnecht/regression_summary
Robustness Summary: https://rpubs.com/JonasKnecht/robustness_summary
Table 01 configurations: https://rpubs.com/JonasKnecht/table01

Next Steps

Jim’s CNN predictions

Since Jim is finishing his CNN this week, I will re-run our regressions with his predictions
His work produced a much wider spread in p_hat_cnn which will be good to compare to the results above
I’m especially interested in seeing how his distribution compares to ours.

Election CNN regressions & MTurk labels

Once the new MTurk labels for the elections data are ready I will re-run all our primary regressions for the election CNN

Date: Tuesday, January 19, 2021

Topic: SUMMARY of baseline regressions and robustness checks

Overview: An overview of what the markdown contains

With these two markdowns I have split the output for our baseline regressions and our robustness analysis. This should hopefully make it easier to piece together what is important. The important headings are also in red. Here is an outline of their contents:

Link 1 - Regression Summary
- Baseline regression with the standard model
- Split by gender
- Baseline regression with the updated model (i.e. the new sampler)
- Distribution plots for the new model
- Split by gender

Link 2 - Robustness Summary
- MTurk label regressions
  - Combined and gender split
  - Split by race (B vs. nB)
- Non-linearity in p-hat-cnn
  - Decile plots
  - Regression with the average decile values
  - Direct coding of the deciles
  - Higher order terms
- Bivariate skin-tone regressions
  - Combined male and female regression (Mturk skin-tone + categories)
  - Gender split (not the signficant effect only for females)
  - Coefficient plots
- Skin-tone sanity checks
  - Plotting skin tone and race labels
  - Skin tone and arrest rate
  - Skin tone and other MTurk labels

Link:

Baseline Regression Summary: https://rpubs.com/JonasKnecht/regression_summary
Robustness Summary: https://rpubs.com/JonasKnecht/robustness_summary

Next Steps

Jim’s CNN predictions

Since Jim is finishing his CNN this week, I will re-run our regressions with his predictions
His work produced a much wider spread in p_hat_cnn which will be good to compare to the results above
I’m especially interested in seeing how his distribution compares to ours.

Election CNN regressions & MTurk labels

Once the new MTurk labels for the elections data are ready I will re-run all our primary regressions for the election CNN

Date: Friday, January 15, 2021

Topic: NEW CNN with a minority-class-sampler in the DataLoader

Overview: An overview of what the markdown contains

Prediction Distribution & Deciles: To check whether the new data-loader increases the spread of predictions from the CNN, I plot the new p_hat_cnn vs. the old models predictions. Since increased spead should help with the linearity in the decile plots, I present a comparison of those as well.

Histogram of p_hat_cnn overlapping the new and old models
Decile plots for the new and old model of p_hat_cnn vs. both the decile index and the respective mean-decile-value

Baseline Regressions: I repeat all our baseline regressions with the new CNN data to make sure our coefficient estimated haven’t changed much. Here I present:

Our baseline regression of all covariates + p_hat_cnn on the release outcome
Split by gender

Non-Linearity Check: Given the changes in the decile plots I repeat our primary non-linearity regressions.

Link:

Minority Sampler CNN Regressions: https://rpubs.com/JonasKnecht/minority_sampler_regressions

Summary

Prediction Distribution & Deciles: Here we can clearly see a massive improvement over the older model, with much more weight being given to the LHS of the original p_hat_cnn distribution. This would indicate that the new minority_class_sampler function works well. Notice that this new CNN was trained with the scaling parameter (which I coded to be the target ratio of majority:minority) set to 2, from the original 3.6 observed in the raw data (i.e. for every personal actually jailed, there are approx. 3.6 not). The distribution plot is the most striking evidence that the new data-loader worked as intended (Eureka !!). The decile plots tell a similar story, in that the new model is strikingly linear and as Decile Plot A - NEW MODEL indicates, the gap between successive deciles is much reduced.
Baseline Regressions: Repeating our baseline regressions with these new CNN predictions doesn’t change the significance of our effect. We do however see a decrease in the coefficient value on p_hat_cnn from 0.403 (in our last baseline) to 0.380. Interestingly, the gap in adj-R-squared actually increases in the new model from 0.112 --> 0.123. The same holds true for the female/male split, with the female coefficient still being greater than the male coefficient. Both are reduced w.r.t. our last baseline. All in all, with the adjustment for the spread of p_hat_cnn we seen to have reduced our model coefficient across the board (given that the linear model is now a much better fit this is not surprising). Note This new data, and the reduction in regression coefficients, is entirely in-line with what we observed in the control for non-linearity !
Non-linearity check: Repeating the non-linearity regressions we notice a much smaller coefficient reduction than seen with our previous CNN data. This links back to the summary above.

Next Steps

Jim’s CNN predictions

Since Jim is finishing his CNN this week, I will re-run our regressions with his predictions
His work produced a much wider spread in p_hat_cnn which will be good to compare to the results above
I’m especially interested in seeing how his distribution compares to ours.

Election CNN regressions & MTurk labels

Once the new MTurk labels for the elections data are ready I will re-run all our primary regressions for the election CNN

Date: Wednesday, January 13, 2021

Topic: Repeating primary regressions with increased MTurk detail

Overview: An overview of what the markdown contains

After collecting at least 6 labels for each of our MTurk features we re-run our baseline regression on release outcome to see whether we have more signficant results.
- Complete male+female sample with the inclusion of 18 skin-tones, age, MTurk labels, p_hat_cov, and p_hat_cnn.
- Split by gender
Along the same logic as above, we repeat the MTurk label regressions to see whether effects are more significant
- We run each label individually and combined
- Split by gender
- Split by race

Link:

Bivariate MTurk Skin-tone Labels: https://rpubs.com/JonasKnecht/mturk_label_detail

Summary

We notice no changes in the overall regression significance of our baseline regression. The 4 MTurk labels are still all insignificant. The coefficient on p_hat_cnn decreases from 0.415 to 0.403 and remains very significant. Splitting by gender reveals that for the female subsample the dominance label becomes significant at the 5 and 10% level. The male subsample is entirely insignificant.
The MTurk label regressions are fully insignificant on the entire male+female sample. Neither the individual labels nor the combined labels are significant at all. On the female subsample we see that all 4 labels are significant. (This is pretty big news !!!) All four labels display positive coefficients and attractiveness is significant at the 1% level. Dominance and Trustworthiness are both significant at the 5% level and Competenceat the 10% level. The male subsample has no significant coefficients. Splitting by race shows no significant coefficients for the black subsample. For non-black individuals we see that trsutworthiness is slighly significant at the 10% level.

Next Steps

Focus on the new sampling data-loader for the CNN

Since we have an imbalanced dataset with about 3.67 times as many people in the released category, we want to test a new CNN dataloader
With a new sampling function for the training data we hope to balance the training process, by over sampling from our minority class

Jim’s CNN predictions

Since Jim is finishing his CNN this week, I will re-run our regressions with his predictions
His work produced a much wider spread in p_hat_cnn which may be a good indication of what we can achieve with a re-written dataloader

Date: Tuesday, January 12, 2021

Topic: Regression of MTurk skin-tone and race labels

Overview: An overview of what the markdown contains

A series of regression of collected skin-tone labels on our final arrest outcome. We want to investigate the relationship between race and release, and make sure we are not missing potentially worrying discrepancies.

We run the following regression on the final arrest outcome:

- Individual skin-tone effects 
- Skin-tone divided into three categories (dark, medium, light)
- MTurk race label

Each of these includes:

- Control for gender (male/female split)
- Coefficient plots for each regression

Link:

Bivariate MTurk Skin-tone Labels: https://rpubs.com/JonasKnecht/skin-tone-regressions

Summary

We find no significant skin-tone effects for the combined sample or for the male/female subsample
Combining skin-tone into three categories yields a significant negative effect for white females

This is the only significant effect among these regressions
We observe a -0.058 change in the arrest outcome for light females over dark females
This is significant at the 1% level

Regressing on the race label yields no significant effect
There is no noticeable pattern in the coefficient plots for the skin-tone regressions

Next Steps

Focus on the new sampling data-loader for the CNN

Since we have an imbalanced dataset with about 3.67 times as many people in the released category, we want to test a new CNN dataloader
With a new sampling function for the training data we hope to balance the training process, by over sampling from our minority class

MTurk label regression with more detail

We are actively collecting the missing label detail with further MTurk surveys and will re-run all our primary regressions
The hope is that we are able to pick up significant effects for labels such as attractivenss which literature suggests should play a role

Jim’s CNN predictions

Since Jim is finishing his CNN this week, I will re-run our regressions with his predictions
His work produced a much wider spread in p_hat_cnn which may be a good indication of what we can achieve with a re-written dataloader

Date: Thursday, January 7, 2021

Topic: Bivariate Regression of MTurk Features and Non-Linearity in p_hat_cnn

Overview: An overview of what the markdown contains

A set of bivariate regressions checking that our MTurk features correlate with arrest-outcome as suggested by the literature

Combined male+female regression of attractiveness, competence, dominance, and trustworthiness
- Each label is regressed individually and once combined
Subsample by gender
Subsample by ethnicity (black vs. non-black)
Subsample on higher detailed labels
(In Appendix) Subsample of females with race-controll (as this is the only one signficant)
(In Appendix) Sanity check for skin-tone relationship to race label

Checking non-linearity in p_hat_cnn

Fixing coding error with p_hat_cnn_decile_average
Raw coding integers 1-10 for decile
Including higher order terms

Link: Links to two separate markdowns for

Bivariate MTurk Labels: https://rpubs.com/JonasKnecht/bivariate_mturk_regressions
Non-linearity in p_hat_cnn: https://rpubs.com/JonasKnecht/non_linearity_updated

Note I kept these separate just because one is a simple sanity check and the other has potentially more next steps and I want to cut down on the length of these output files to make them a little more comprehensible.

Summary

Bivariate Regressions (See Link 1):

Combined male+female regressions still yield no significant results for any individual MTurk label or the combined model on release. The same holds for the male subset. For females we see that attractiveness, dominance, and trustworthiness are significant on their own, but not when combined. This is in-line with what we observed in the previous log. All effects are insignificant for the black vs. non-black subsets. However, controlling for race, the female subset effects become more significant with similar coefficient magnitudes.

Repeating these regressions for the well-labeled subset results in no significant effects for females. This may well be due to the lack of data, as we are restricting to 101 observations for the female subset. However, as this had been significant in the previous log, with the inclusion of all other regression covariates, I am a little puzzled by this ! The combined male+female model sees significant effects for competence and trustworthiness on the well labeled subset. We notice the same significance with the black-subset on the well-labeled set.

All in all these regressions are a bit puzzling, but seem to suggest that there is some signal stemming from these labels.

Skin-tone and race sanity check (See Link 1):

To make sure our skin-tone labels make sense I plot their respective relationship to our race indicator. All seems to be in order here with darker skin tones receiving higher proportions of the black label.

Non-linearity (See Link 2):

The main point is that I found my mistake and re-running the decile-average regression now yields significant results with a coefficient similar to the one seen previously. I also went ahead and ran a regression with higher order terms, none of which appear significant.

Next Steps

Re-configuring the CNN data-loader to adjust for the imbalance in our sample.
- Need to adjust for the imbalance in our data as most people are released
- Write a new sampler for the training data and ensure the validation data is unchanged
Increase MTurk label collection with larger number of workers per image
- As we see some signal coming from the labels we want more detail on them
- Increase the number of workers from 3 to 6 per image
Re-run the baseline regressions with higher detail MTurk labels

Date: Tuesday, January 5, 2021

Topic: Regression Sanity Checks

Overview: An overview of what the markdown contains

This analysis focused on extending the primary regression of risk, skin_tone, age, MTurk labels, covariates and p_hat_cnn on release outcome. Below is an overview of the different elements:

Per-Decile regressions:

Running our primary regression on subsamples of the risk_pred_prob distribution
Rows are grouped into decile 1-3, 4-6, 7-10

Investigating non-linearity in p_hat_cnn:

Replacing p_hat_cnn by the per-decile mean
Replacing p_hat_cnn by the per-decile mean and collapsing decile 1-3
Replacing p_hat_cnn by a decile indicator (1-10)

Repeating main regression on a subset of detailed MTurk labels:

Restricting our data to those with more than 3 MTurk labelers.
Between 6-9 workers per image leaving 499 observations
Split further between male (_N = 348) and female (_N = 101)

Investigating relationship between skin-tone, MTurk-features, and arrest-rate

We plot:
- Mean arrest rate vs. skin-tone
- Mean attractiveness vs. skin-tone
- Mean competence vs. skin-tone
- Mean dominance vs. skin-tone
- Mean trustworthiness vs. skin-tone

Link: https://rpubs.com/JonasKnecht/updated_regressions

Summary: A summary of the conclusions drawn from the analysis

Per-Decile Regressions

We primarily note that p_hat_cnn remains significant at all levels. Interestingly the magnitude of the coefficient increases with the decile of risk_pred_prob we control for. This would indicate that at higher underlying risk, as indicated by our xg_boost model, the CNN is able to account for more variance in the judges decision. The coefficient on p_hat_cnn increases from: 0.317 at deciles 1-3, to 0.367 at deciles 4-6, to 0.514 at deciles 7-10. We also note similar increases in explanatory power when looking at the relative changes in adjusted R-squared.

Non-linearity in p_hat_cnn

Here we were prompted to account for the non-linearity in p_hat_cnn as indicated by the decile plots (top of the markdown). Including the p_hat_decile_average suggests no significant effect, something which is likely an error in the code and will be looked at further. Similarly collapsing the bottom three deciles yields no results, and the inclusion of simple 1-10 values also remains insignificant. Looking into this further is going to be a next step !

Repeating regressions with more MTurk detail:

Repeating our main regression on a subset of individuals with higher-accuracy MTurk labels (i.e. those with more workers-per-image) yields no significant results for any of our labels on a combined male + female data-set. The labels in question are: attractiveness, competence, dominance, and trustworthiness. These regressions include our MTurk labels together with our other arrest covariates, skin-tone, and p_hat_cnn. We also notice that p_hat_cnn is now insignificant, which is most likely due to the reduction in data.

Controlling for gender yields no significant results for males. For the female population, which includes only 101 observations, we notice that attractiveness and trustrworthiness are now significant at the 5 and 10 % level respectively. This was intended as a proof of concept on whether it would be worth increasing the number of labelers per image, i.e. whether we are likely to see any significant signal coming from these features. The female subset suggests that this is indeed the case and we go forward with future MTurk surveys increasing the number of workers per image from 3 to 6.

Skin-tone plots:

Skin tone seems to have no direct relation to arrest-rate. Attractiveness, trustworthiness, and competence are highly positively correlated, with lighter skin tones receiving higher scores on both labels. Dominance is highly negatively correlated to skin-tone with lighter skin receiving lower scores. This is in-line with other literature findings.

Next-Steps: A summary of what we are investigating next

Checking the relationship between our MTurk labels and release rate. We want this to conform to the literature which indicates that attractiveness should be significant.
Set of bivariate regressions of the MTurk labels on release
Include controls for race and gender
Look again at the non-linearity in p_hat_cnn and make sure there are no coding mistakes
Re-configuring the CNN data-loader to adjust for the imbalance in our sample.