The Distribution of Citations

Creating the distribution using table

Now I’d like to know how many times each case is cited. The R command for finding this is table.

edgetable<-table(edges[,"cited"])

This creates a table object, which is fine, but I prefer working with data.frame objects, so I will convert it. I add the optional argument stringsAsFactors=FALSE because I don’t need the complications that come from using R Factors. Trust me on this for now.

tally<-as.data.frame(edgetable,stringsAsFactors=FALSE)

Again, I’d like to give the colums in tally some evocative names.

colnames(tally)<-c("caseid","citations")

Before we go much further, however, I’d like to fiddle a bit with the data. (Before I realized this, I managed to crash R. Don’t repeat my error.) If you run the following code, you will see that the caseids are stored not as integers, but as characters (strings).

class(tally$caseid)

## [1] "character"

But for statistical analyses, it will definitely be helpful to treat the caseids as numbers. So, we need to convert it.

tally$caseid<-as.integer(tally$caseid)

Basic Plotting

I can now make a simple histogram showing the distribution of citation counts.

hist(tally[,"citations"])

plot of chunk unnamed-chunk-8

So, we can immediately see that there are an awful lot of Supreme Court cases that don’t get cited a lot and that the probability of being cited very frequently declines swiftly.

More advanced plotting using the ggplot2 package

The aesthetics of the histogram created using R’s built-in functionality is neither particularly detailed nor very pleasing. So, I’m going to load in one of the most popular R packages, ggplot2, in order to make nicer graphics. By the way, for those who like “buying local,” ggplot2 is the work of a Rice University professor, Hadley Wickham. I’ve already installed the package on my computer.

require("ggplot2")

## Loading required package: ggplot2

There are essentially two ways of making plots using ggplot2. One can use the “simple” way using the qplot command or one can construct the plots in a more complicated way using ggplot and some related functionality. I actually think using the ggplot way is often better, so I will show both.

qplot1<-qplot(data=tally,x=citations,geom="histogram",binwidth=10,)
qplot1

plot of chunk unnamed-chunk-10

So here is how to read the command above. We are going to create a variable named qplot1 that will hold a plot. The variable is created using the qplot{ggplot2} command. (By the way, the notation somecommand{somepackage} is used to denote that somecommand is located in somepackage). I’m going to use the “named arguments” construct of R because I happen not to like the default order of arguments used by qplot. The first argument here contains the source of the data, tally. There is only one column that I need to create the plot, citations, so I assign it to what qplot calls x. What I want to produce is a histogram, so I set geom, which you can kind of think of as the drawing routine, to histogram, and I adjust the width of the bins in the histogram to contain 10 citations.

Now I will do this the fancier way. I use the ggplot (not ggplot2) command. Again, I tell ggplot2 that the data is located in tally. I now tell ggplot how data contained in tally is to be mapped to aesthetic features of the plot. Here, all I have to tell it is that the only thing to be plotted, citations is mapped to the “x” feature of whatever plot I happen to use. To help ggplot understand this is such a mapping, I wrap this information up in an aes command. The name of the command is short for aesthetic, although, personally, I don’t find aes a very good term. I would have called it datamapping, but, then, I am not a professional software designer. The ggplot command creates a data.frame that I will call gghistdata. Standing by itself like this, ggplot does not actually produce any visualization. It just creates a fancy database that can then be used to produce plots. This is probably a good thing and is one of the central ideas of the ggplot2 package.

gghistdata<-ggplot(data=tally,aes(x=citations))

Now I will create the code to produce the visualization. The idea is to create a “layer.” The layer will use whatever data it is presented with to create bars. The bars will be colored “steel blue.” The transformation I will use on the data is to bin it, which is, of course, what one needs to do to produce a histogram. And I will adjust the binning so that each bin has a width of 10. Notice that histolayer has no idea of what data it is going to be used with. One of the ideas is that one can create these layers and mix and match them with various databases created by ggplot.

histolayer<-layer(geom="bar",
                  geom_params=list(fill="steelblue"),
                  stat="bin",
                  stat_params=list(binwidth=10))

Now time to put it all together. The idea of ggplot is to “overload” the addition operator. So, in the code below I am not really “adding” a database to a function. That would be ridiculous. The plus is used as a kind of evocative shorthand to say that I am combining a database with a drawing algorithm.

gghistdata+histolayer

plot of chunk unnamed-chunk-13 There’s also a shorter way of using ggplot. Instead of creating a separate layer explicitly, one can use a function built in to ggplot that itself creates a layer. There are many such functions. One of them is geom_histogram, which you might guess creates a layer that, when combined with data, produces a histogram. Here I am going to create a pink histogram (why not?) and have a bin width of 5.

gghistdata+geom_histogram(fill="pink",binwidth=5)

plot of chunk unnamed-chunk-14

Some visual and statistical analysis of the relation between chronology and citation count.

Time for our next investigation. Is there any chronological pattern to the number of citations? Do more recent cases get cited more frequently? Or are the older cases cumulatively cited more? Or maybe the Goldilocks principle applies and cases that are kind of in the middle get cited more? Let’s explore this idea using R.

The code below used qplot to create a scatterplot. I’m going to take every 100th case in tally. (I could take more but R kind of chokes on plotting 30,000 points.) The x value on the scatterplot will be the caseid (a pretty good proxy for time of decision) and the y value on the scatterplot will be the number of citations. By specifying geom=“point”, I make clear that I want a scatterplot.

qplot2<-qplot(data=tally[seq(1,length(tally$citations),100),],x=caseid,y=citations,geom="point") 
qplot2

plot of chunk unnamed-chunk-15

The code below shows how I could use ggplot to get a little more control over the plotting process. I can, for example specify that I want big red stars to mark the points.

gghistdata2<-ggplot(data=tally[seq(1,length(tally$citations),100),],aes(x=caseid,y=citations)) 
scatterlayer<-layer(geom="point",geom_params=list(color="red",size=5,shape=8))
gghistdata2+scatterlayer

plot of chunk unnamed-chunk-16

I don’t know about you, but I look at this scatterplot and I don’t see much. It may be that the effect is subtle or that by just picking every 100th case in order to make visualization more tractable I lose needed information. I’m not going to show swiftly how R can use some statistical techniques to see if there might be any relationship between chronology and citation

I can use R’s lm function to do a simple linear regression of citations on caseid.

lm1<-lm(data=tally,formula=citations~caseid)
summary(lm1)

## 
## Call:
## lm(formula = citations ~ caseid, data = tally)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.33  -6.62  -3.99   1.89 236.49 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.28e+00   1.77e-01    35.5   <2e-16 ***
## caseid      2.00e-04   1.02e-05    19.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.1 on 23283 degrees of freedom
## Multiple R-squared:  0.0162, Adjusted R-squared:  0.0162 
## F-statistic:  384 on 1 and 23283 DF,  p-value: <2e-16

What we can see from the summary of our computation is that caseid has a very small (but statistically significant) positive effect on the frequency of citation. We can also see, however, from our R-squared values that caseid does not explain much at all of the variation in citations.

Perhaps the problem is that the linear model is misspecified, that really the Goldilocks principle applies and there is a sweet spot. If this were true, our linear model might show little relationship when in fact a significant relationship exists. So, I am going to try a quadratic model. One way to do this is to add to our citations database a new column called caseid2 that is simply the square of caseid. I can then regress citations on caseid and caseid2.

tally$caseid2<-(tally$caseid)^2
lm2<-lm(data=tally,formula=citations~caseid+caseid2)
summary(lm2)

## 
## Call:
## lm(formula = citations ~ caseid + caseid2, data = tally)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10.29  -7.10  -3.55   1.92 237.51 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.87e+00   2.74e-01    10.4   <2e-16 ***
## caseid       8.41e-04   4.08e-05    20.6   <2e-16 ***
## caseid2     -2.10e-08   1.30e-09   -16.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 23282 degrees of freedom
## Multiple R-squared:  0.0272, Adjusted R-squared:  0.0271 
## F-statistic:  325 on 2 and 23282 DF,  p-value: <2e-16

What I see is that the sign in front of the square term is negative, which does indeed suggest a bit of a goldilocks effect. The result is statistically significant but the magnitude of the effect is again small. Even with the quadratic term added, I am accounting for less than 3% of the variation in citation count.

I can use our friend ggplot to visualize the result of the regression.

ggplot(data=data.frame(x=c(0,30000)),aes(x=x))+
  stat_function(fun=(function(x){2.867+0.00084*x-2.099e-08*x^2}),geom="line")

plot of chunk unnamed-chunk-19

Network Analysis

Finding the most cited cases

I now want to find the most frequently cited cases. To do this, I use the order command in R. Note the use of a negative sign in front of the citations argument in the order command. This tells R I want the data ordered from highest citation count to lowest.

tallyS<-tally[with(tally, order(-citations)), ] 
top100<-tallyS[1:100,"caseid"]

I now want the subgraph of citations from one case in the top 100 to another case in the top 100. To do this I set up conditions on the rows of edges that I want to include. The conditions make use of the %in% operator that determines membership in a set and the & logical operator that says that both conditions must be true before the item will be selected. I want all columns of the edges data frame.

edges100<-edges[(edges$citing %in% top100) & (edges$cited %in% top100),]

To do the actual network analysis, I will make use of the igraph package. I load it up and then convert my edges100 data.frame into a special data.frame that igraph likes to use.

require("igraph") # get the network analysis package

## Loading required package: igraph

g<-graph.data.frame(edges100)

Generalizing the process

Before I go further, I’d like to generalize what I just did. I want to write a function that takes some data.frame e of edges and n, an integer containing the number of cases I want to include.

makeg<-function (e,tally,n) {
  sorted<-tally[with(tally,order(-citations)), ] 
  top<-sorted[1:n,"caseid"]
 graph.data.frame(e[(e$citing %in% top) & (e$cited %in% top),])
}

I can now run this function to find, for example, the intracites between the top 25 cases or the intracites between the top 100 cases.

gtop25<-makeg(edges,tally,25)
gtop100<-makeg(edges,tally,100)

Back to visualization

Let’s now try to visualize the network created by intracites among the top 100 cases. We have lots of choices as to the embedding (layout) of the graph. I’m going to use the “Sugiyama” layout, which is supposed to be well suited for networks that have a chronological structure to them.

l<-layout.sugiyama(gtop100)

I can now plot the network.

plot(gtop100,layout=l$layout)

plot of chunk unnamed-chunk-26

At the moment, I do not find this particularly revealing. I am working on seeing if there is either a better embedding or if there is some error in the data.

The Network of Supreme Court Intracites

Introduction

Reading in the data