Seth J. Chandler
August 28, 2014
This problem introduces students to some basic R functionality and, as a bonus, explores the interesting differences in behavior among various immigration judges in the United States. As preparation for the assignment, students were asked to suck in a web-based csv file on the behavior of immigration judges in the United States, make some minor modifications in Google Spreadsheets, and save the document as a csv file in a place they could access it. For those who want to explore the materials discussed in this report without going through the Google Spreadsheets process, you can get the material here.1
Our first task is to load in data you previously created in Google Spreadsheets and downloaded to your hard drive as a CSV file. To bring this data into R, the appropriate command is read.csv. I print out the “head” of the resulting data.frame.2
csvfile<-"~/Dropbox/Courses/Analytic Methods/Problem Sets/Immigration Judge Data Augmented R.csv"
Immigration.Judge.Data<-read.csv(csvfile)
head(Immigration.Judge.Data)
## X Court Judge Decisions Percent.Grants Percent.Denials
## 1 1 Adelanto Lee, Amy T. 120 7.5 92.5
## 2 2 Adelanto Burke, David H. 197 24.4 75.6
## 3 3 Adelanto Laurent, Scott D. 147 27.2 72.8
## 4 4 Arlington Harris, Rodger C. 131 14.5 85.5
## 5 5 Arlington Crosland, David W. 212 73.1 26.9
## 6 6 Arlington Iskra, Wayne R. 729 73.5 26.5
## Grants Denials
## 1 9.00 111.00
## 2 48.07 148.93
## 3 39.98 107.02
## 4 19.00 112.00
## 5 154.97 57.03
## 6 535.82 193.19
Using as much elegance as you can, have R compute the aggregate grant and denial rates for Immigration Judges in Houston. (Assume Houston-detained is a different location than Houston).
I am asking you here to use the filtering capabilities of R. The first thing I want to do is create a mini-spreadsheet that only contains the Houston rows and only contains the Decisions, Grants and Denials columns.
print(houston<-
Immigration.Judge.Data[Immigration.Judge.Data$Court=="Houston",
c("Decisions","Grants","Denials")]
)
## Decisions Grants Denials
## 82 111 13.99 97.01
## 83 123 29.03 93.97
## 84 204 57.94 146.06
## 85 188 53.96 134.04
## 86 275 86.08 188.93
## 87 217 92.01 124.99
Notice that because I was (deliberately) sloppy when I created my Google Spreadsheet, the grants and denials are not integers. Obviously, however, there are no fractions of judicial grants and denials of asylum. The fractions shown here are an artifact of multiplying previously rounded numbers together in Google Spreadsheets.3 So, I think it would be a good idea to use R to round the “Grants” and “Denials.” The command that would be perhaps most helpful is lapply. The idea of lapply is to apply a function to each element of a list. Remember, that a data.frame in R is basically a list of columns. To apply the rounding just to the columns that need it, I use c(“Grants”,“Denials”) on the left hand side of the assignment and the same filtering construct on the right hand side to apply the round function to just the same two columns.
houston[,c("Grants","Denials")]<-lapply(houston[,c("Grants","Denials")],round)
houston
## Decisions Grants Denials
## 82 111 14 97
## 83 123 29 94
## 84 204 58 146
## 85 188 54 134
## 86 275 86 189
## 87 217 92 125
If this were a spreadsheet, you would probably take the sum of each column and then do some division. We can do the same thing in R. The command that would be perhaps most helpful is lapply. The idea of lapply is to apply a function to each element of a list. Remember, that a data.frame in R is basically a list of columns.
print(houston.rates<-lapply(houston,sum))
## $Decisions
## [1] 1118
##
## $Grants
## [1] 333
##
## $Denials
## [1] 785
To get the aggregate grant and denial rates, we have a simple division problem. I use c to combine the numerators into a little vector (c(houston.rates\(Grants,houston.rates\)Denials)) and then divide that vector by houston.rates$Decisions. I round the resulting quotient to three decimal places.
round(c(houston.rates$Grants,houston.rates$Denials)/houston.rates$Decisions,3)
## [1] 0.298 0.702
Using as much elegance as you can, have R find the Immigration Judge with at least 200 decisions who had the highest grant rate. (Notice anything about these judges?) Have R find the Immigration Judge with the highest denial rate. (Notice anything about the top three judges?)
The first thing I want to do is find the judges with at least 200 decisions. There are a lot of them so I will just print out the first couple using head.
head(judges.200plus<-Immigration.Judge.Data[Immigration.Judge.Data$Decisions>=200,])
## X Court Judge Decisions Percent.Grants
## 5 5 Arlington Crosland, David W. 212 73.1
## 6 6 Arlington Iskra, Wayne R. 729 73.5
## 7 7 Arlington Burman, Lawrence O. 654 74.0
## 8 8 Arlington Snow, Thomas G. 825 75.6
## 10 10 Arlington Bryant, John Milo 538 81.4
## 11 11 Arlington Schmidt, Paul W. 446 83.2
## Percent.Denials Grants Denials
## 5 26.9 155.0 57.03
## 6 26.5 535.8 193.19
## 7 26.0 484.0 170.04
## 8 24.4 623.7 201.30
## 10 18.6 437.9 100.07
## 11 16.8 371.1 74.93
I now want to sort the judges high to low by grant rates. To do this, I use the R order command. The idea of order is that it takes some data, here the grant rate, and returns a list of integers. The integers represent positions in the original data. If one picked those positions in order, one would obtain a sorted list.
Note that I stick a minus sign in front of the data. This tells order that I want the list sorted from highest to lowest and not lowest to higher.
Note also that I use a second argument to head so that I can see the positions of the top 20 judges. There seems to be something “funny” going on in that there appears to be a cluster of judges numbered 140 to about 123 who have very high grant rates. This is an artifact of the fact that the immigration judge data was previously sorted by geographic region and that geography apparently makes a big difference in grant rates.
head(judges.200plus.orderingByGrants<-order(-judges.200plus$Percent.Grants),n=20)
## [1] 140 139 138 137 136 135 133 134 132 131 129 130 128 6 127 5 126
## [18] 124 125 123
Who are these judges? Let’s look at the top 20.
judges.200plus[judges.200plus.orderingByGrants[1:20],
c("Court","Judge","Decisions","Percent.Grants")]
## Court Judge Decisions Percent.Grants
## 196 New York Lamb, Elizabeth A. 1346 96.1
## 195 New York Bain, Terry A. 1775 93.9
## 194 New York Brennan, Noel A. 2029 91.7
## 193 New York Bukszpan, Joanna M. 1039 91.0
## 192 New York McManus, Margaret 1831 89.1
## 191 New York Loprest, F. James, Jr. 807 88.4
## 189 New York Laforest, Brigitte 1807 86.5
## 190 New York Mulligan, Thomas J. 1426 86.5
## 188 New York Gordon-Uruakpa, Vivienne E. 1546 86.2
## 187 New York Morace, Philip L. 2322 85.9
## 185 New York Chew, George T. 1860 84.9
## 186 New York Sichel, Helen J. 1633 84.9
## 184 New York Schoppert, Douglas B. 1490 84.6
## 11 Arlington Schmidt, Paul W. 446 83.2
## 183 New York Van Wyke, William Van 1309 82.9
## 10 Arlington Bryant, John Milo 538 81.4
## 182 New York Segal, Alice 886 81.0
## 180 New York Weisel, Robert D. 989 80.8
## 181 New York Zagzoug, Randa 750 80.8
## 179 New York Rohan, Patricia A. 1558 79.2
Hmm. It appears that the top granters are almost all in New York, with a few in Arlington, Virginia. This is an interesting finding in and of itself. Now, it could be that the facts of the cases in New York are materially different from those elsewhere, but the data is certainly worthy of further investigation.
And what about the judges with the highest denial rates? Is there any pattern. Let’s whip up some R code to find out.
judges.200plus[order(-judges.200plus$Percent.Denials)[1:20],
c("Court","Judge","Decisions","Percent.Grants")]
## Court Judge Decisions Percent.Grants
## 160 Miami - Krome Opaciuch, John 200 0.5
## 161 Miami - Krome Opaciuch, Adam 329 1.5
## 162 Miami - Krome Hurewitz, Kenneth S. 305 2.3
## 241 San Francisco Murry, Anthony S. 410 2.7
## 98 Los Angeles Munoz, Lorraine J. 643 6.2
## 12 Atlanta Wilson, Earle B. 335 6.3
## 242 San Francisco Yamaguchi, Michael J. 244 6.6
## 73 Florence Taylor, Bruce A. 251 7.6
## 66 El Paso Abbott, William L. 272 7.7
## 163 Miami - Krome Slavin, Denise N. 219 7.8
## 164 Miami - Krome Ford, Rex J. 291 7.9
## 226 San Antonio Burkholder, Gary D. 287 8.4
## 23 Bloomington Nickerson, William J., Jr. 279 10.0
## 200 Newark Reichenberg, Margaret R. 359 10.6
## 207 Omaha Morris, Daniel A. 442 10.6
## 48 Cleveland Evans, D. William, Jr. 594 12.6
## 99 Los Angeles Riley, Kevin W. 444 12.6
## 37 Charlotte Pettinato, Barry J. 315 12.7
## 91 Imperial Staton, Jack W. 305 14.1
## 94 Las Vegas Romig, Jeffrey L. 308 15.6
Clearly, if you’re seeking asylum in the United States, Miami should not be your first destination. Again, conceivably this is because the facts of cases in Miami differ systematially from those brought elsewhere in the United States, but the high rate of denials seems rather curious.
I have a theory that the rate of denials is related to the number of cases the judge has decided. The more cases, the higher the rate of denials. See if you can use R to figure out if the data supports this theory.
There are many ways we can explore this hypothesis. And, indeed, a good chunk of the course in Analytic Methods is devoted to this process. Let’s take a look here, however, at some simple methods.
One simple approach would be to see if there is any correlation between the decisions and denials. The key function to accomplish this in R is cor. This function cor computes something known as the Pearson correlation coefficient. It’s 1 if the data is perfectly correlated, zero if the data is completely uncorrelated and -1 if the data is perfectly inversely correlated. You can explore Pearson (and other) measures of correlation by going to this website.
cor(Immigration.Judge.Data[c("Decisions","Percent.Denials")])
## Decisions Percent.Denials
## Decisions 1.0000 -0.5296
## Percent.Denials -0.5296 1.0000
What we see appears to be kind of the opposite of what I predicted. There appears to be a negative correlation between numbers of decisions and denial rates. Maybe judges, instead of getting harsher as they gain more experience, get more soft hearted. Or maybe there are other factors at work.
We can visualize the correlation using the pairs function in R. Hopefully, the output is clear enough to give you a feel for what is going on.
pairs(Immigration.Judge.Data[c("Decisions","Percent.Denials")])
##Linear regression We can do a linear regression of the rate of denials on the number of decisions.
summary(lm(data=Immigration.Judge.Data,Percent.Denials~Decisions))
##
## Call:
## lm(formula = Percent.Denials ~ Decisions, data = Immigration.Judge.Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.36 -14.20 -0.52 17.08 43.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.93358 1.81362 39.7 <2e-16 ***
## Decisions -0.03305 0.00323 -10.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.9 on 268 degrees of freedom
## Multiple R-squared: 0.28, Adjusted R-squared: 0.278
## F-statistic: 104 on 1 and 268 DF, p-value: <2e-16
I know we haven’t studied linear regression yet, but if you look at the coefficients section of the output and look under the “Estimate” for “Decisions,” you will see that it is negative. This means that Decisions appears to have a negative effect on the rate of denials, the opposite of what I thought would be the case. We can also see that R has placed three asterisks in the rightmost column of the coefficients output. Without getting into details here, this means that the result is statistically significant with a great deal of confidence. We can also look, by the way, at the “Adjusted R-squared.” It shows a value of 0.278. This means, very roughly, that one can account for 27.8% of the variation in the denial rate just by looking at the number of decisions. ##Logit and Probit regressions But a linear regression isn’t really proper where the value we are trying to predict is a percentage. So we can try logit and probit forms of regression, which are more appropriate. To do this, however, I first have to make sure that the percent denials are decimal fractions rather than percentage points. I can then use R’s glm function to run these more advanced forms of regression.
Immigration.Judge.Data$Percent.Denials<-Immigration.Judge.Data$Percent.Denials/100.
print(summary(glm(data=Immigration.Judge.Data,Percent.Denials~Decisions,
family=binomial(link="logit"))))
## Warning: non-integer #successes in a binomial glm!
##
## Call:
## glm(formula = Percent.Denials ~ Decisions, family = binomial(link = "logit"),
## data = Immigration.Judge.Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0713 -0.2855 -0.0262 0.3479 0.9681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.952721 0.203947 4.67 3e-06 ***
## Decisions -0.001510 0.000403 -3.75 0.00018 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.790 on 269 degrees of freedom
## Residual deviance: 50.206 on 268 degrees of freedom
## AIC: 324.5
##
## Number of Fisher Scoring iterations: 4
print(summary(glm(data=Immigration.Judge.Data,Percent.Denials~Decisions,
family=binomial(link="probit"))))
## Warning: non-integer #successes in a binomial glm!
##
## Call:
## glm(formula = Percent.Denials ~ Decisions, family = binomial(link = "probit"),
## data = Immigration.Judge.Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0693 -0.2847 -0.0209 0.3463 0.9633
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.586240 0.122410 4.79 1.7e-06 ***
## Decisions -0.000919 0.000234 -3.92 8.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.79 on 269 degrees of freedom
## Residual deviance: 50.24 on 268 degrees of freedom
## AIC: 324.6
##
## Number of Fisher Scoring iterations: 4
If you look at the output from these these regressions, you will see roughly the same story as we saw with linear regression: the number of decisions is in fact negatively correlated with denial rate.
It certainly appears that geograpy plays a significant role in the probability that one will receive political asylum from an immigration judge. But in order to assert this proposition with greater confidence, we’d need to look at many more variables, such as the distribution of the nation from which the asylum seekers are coming.
The underlying data comes from Syracuse University and their trac project. You can find the material from which I originally extracted the data here.↩
In Mathematica, if you just want to see a short form of the expression, you use Short.↩
Obviously, the fix should have been done in Google Spreadsheets itself using its Round command. My failure to do so gives me the opportunity of showing how it can be done using lapply in R.↩