Introduction to R

This file is online!

https://gist.github.com/alexstorer/5460479

R is an interpreted language like Python

"hello world"

## [1] "hello world"

If you use NumPy/SciPy, here is a conversion to R: http://mathesaurus.sourceforge.net/matlab-python-xref.pdf

R is better than Python

The language is designed to make statistics and data handling easy

mtcars["Honda Civic", "mpg"]

## [1] 30.4

lm(mpg ~ cyl + gear + wt, data = mtcars)

## 
## Call:
## lm(formula = mpg ~ cyl + gear + wt, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl         gear           wt  
##      42.386       -1.528       -0.523       -3.392

Math, Statistics and Matrices work out of the box

2 + 2

## [1] 4

log(2^(0.25))

## [1] 0.1733

Install add ons from inside R

# install.packages('ggplot2')
library(ggplot2)

Superb graphing tools

ggplot(data = mtcars, aes(x = mpg, y = wt, color = cyl)) + geom_point()

plot of chunk unnamed-chunk-5

Incredible IDE: RStudio

You can even run RStudio on the internet, and set up your own FREE Amazon instance to run it:

http://www.louisaslett.com/RStudio_AMI/

CRAN Task Views (http://cran.r-project.org/web/views/) summarize 3rd party packages that do many types of anaylsis

Python is better than R

Installed by default

Syntax is much easier to learn and use

Powerful and intuitive data structures

Outstanding text processing

RPy2 lets you use R from Python

In practice, I use both!

Today's Goal: Getting Started With R

Today's Strategy

Jump in with both feet, worry about the details later! More time spent doing.

Let's get data into R

df <- read.csv("http://files.figshare.com/983501/nsfastgrantbib20121130.csv")

This is the sort of thing designed to just work in R
Unlike any other programming language, the <- symbol does assignment
Our data is now in a data frame, the best data structure in R
- Data Frames are like a matrix for data, but the rows and columns have names
- Different data types can be in the same row (same observation)
- We pull out data using the dollar sign $
- We can also treat our data frame just like a matrix, and use square brackets ([])
R has a special data type for missing data, called NA

Data Frame examples

names(df)

##  [1] "GrantPaperID"             "NSFOrg"                  
##  [3] "GrantNumber"              "GrantTitle"              
##  [5] "AwardedUSD"               "ARRAAmount"              
##  [7] "GrantLink"                "PaperTitle"              
##  [9] "PaperJournal"             "PaperBibcode"            
## [11] "PaperLink"                "Refereed"                
## [13] "CitationCount"            "NSFPrograms"             
## [15] "GrantStartMonth"          "GrantStartYear"          
## [17] "PaperPubDate"             "PIFirstName"             
## [19] "PILastName"               "PaperAuthor"             
## [21] "PIMatch"                  "GranteeOrganization"     
## [23] "GranteeOrganizationState" "GranteeOrganizationZip"  
## [25] "NSFDirectorate"           "PaperAuthorAffiliation"  
## [27] "PaperAuthorCountry"       "PaperKeywords"

df$PaperTitle[100]

## [1] Dust, Atomic, and Molecular Gas in the Nearest Primitive Environment
## 20200 Levels:  ...

df[100, "PaperTitle"]

## [1] Dust, Atomic, and Molecular Gas in the Nearest Primitive Environment
## 20200 Levels:  ...

head(df$PaperTitle)

## [1] 0 papers                                                                                                                                             
## [2] 0 papers                                                                                                                                             
## [3] Spitzer Evidence for a Late-heavy Bombardment and the Formation of Ureilites in η Corvi at ~1 Gyr                                                    
## [4] Spitzer Evidence for a Late Heavy Bombardment and the Formation of Urelites in Î· Corvi at ~1 Gyr                                                    
## [5] Mid-infrared spectra of differentiated meteorites (achondrites): Comparison with astronomical observations of dust in protoplanetary and debris disks
## [6] Spitzer Observations of η Corvi: Evidence at ~1 Gyr for an LHB-Like Delivery of Organics and Water-Rich Material to the THZ of a Sun-Like Star       
## 20200 Levels:  ...

Returning subsets of data frames

df[5:10, c("PaperTitle", "GrantTitle")]

##                                                                                                                                               PaperTitle
## 5  Mid-infrared spectra of differentiated meteorites (achondrites): Comparison with astronomical observations of dust in protoplanetary and debris disks
## 6         Spitzer Observations of η Corvi: Evidence at ~1 Gyr for an LHB-Like Delivery of Organics and Water-Rich Material to the THZ of a Sun-Like Star
## 7                                                                                                                            Ionization of Infalling Gas
## 8                                                                         Sodium atoms in the lunar exotail: Observed velocity and spatial distributions
## 9                                                                                   Measurement of a Magnetic Field in a Leading Arm High-velocity Cloud
## 10                                                                    Vertical Structure of a Supernova-driven Turbulent, Magnetized Interstellar Medium
##                                                                                        GrantTitle
## 5                            Infrared Studies of Circumstellar Dust in Planet-Forming Exo-Systems
## 6                            Infrared Studies of Circumstellar Dust in Planet-Forming Exo-Systems
## 7  Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 8  Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 9  Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 10 Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper

The c command, for combine, combines variables into arrays. The : operator returns a vector of values

df[seq(5, 10, 2), c("GrantStartYear", "GrantStartMonth")]

##   GrantStartYear GrantStartMonth
## 5           2009               9
## 7           2006               9
## 9           2006               9

The seq command is a more flexible version of :

A note on this data

Each row is a Grant-Paper combination
- One paper can be on multiple rows
- One grant can be on multiple rows
- Was probably merged from two separated tables

How many papers have more than 50 citations?

head(df$CitationCount > 50)

## [1]    NA    NA FALSE FALSE FALSE FALSE

table(df$CitationCount > 50)

## 
## FALSE  TRUE 
## 20884  4827

head(which(df$CitationCount > 50))

## [1]  89  90  98 135 136 140

paste("There are", length(which(df$CitationCount > 50)), "papers")

## [1] "There are 4827 papers"

head returns the first few elements of what you give it
table summarizes a list by its unique values
which gives the TRUE indices in a logical array
length gives the length of any array
paste concatenates strings

But these are not unique!

numPapers <- length(unique(df[df$CitationCount > 50, "PaperTitle"]))

[df$CitationCount>50,"PaperTitle"] - Just the titles
unique removes any repeats
length tells us how many there are

How many different grants are in our data set?

# Try this one yourself!

The duplicated function

Like unique but gives us the indices (logical or otherwise).

table(duplicated(df$GrantNumber))

## 
## FALSE  TRUE 
##  3432 26075

What is the mean of the money awarded for papers with at least 50 Citations? Less than 10 citations?

popular.papers <- subset(df, CitationCount > 50)
less.popular.papers <- subset(df, CitationCount < 10)
mpp <- mean(popular.papers$AwardedUSD)
lpp <- mean(less.popular.papers$AwardedUSD)
paste(">50:", mpp, "<10", lpp)

## [1] ">50: 1184260.64926455 <10 1482004.83885319"

subset returns a data frame with only the matching elements
mean takes the mean of your data

Make a table of award amounts by presidential administration

df$AwardAmountGroups <- cut(df$AwardedUSD, 3, labels = c("low", "med", "hi"))
df$Administration <- cut(df$GrantStartYear, breaks = c(1992, 2000, 2008, 2013), 
    labels = c("Clinton", "Bush", "Obama"))

with(df, table(Administration, AwardAmountGroups))

##               AwardAmountGroups
## Administration   low   med    hi
##        Clinton  9318    52     0
##        Bush    16348     3     2
##        Obama    3784     0     0

$ and <- can be used to assign a new column to a data frame
cut splits your data into groupings by value, either into N groups, or as directed

Exercises

Do grants start more frequently at a given time of year?
What proportion of papers are referreed?
What state received the most grants?
Find the most cited paper. How many other papers were published on that grant?
How many papers refer to at least three grants?

Solutions

Grant Start Month

df.grants <- df[!duplicated(df$GrantNumber), ]
table(df.grants$GrantStartMonth)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  76  53 110 115 187 333 723 707 870 193  33  32

plot(table(df.grants$GrantStartMonth))

plot of chunk unnamed-chunk-16

Refereed Papers

df.papers <- df[!duplicated(df$PaperTitle), ]
table(df.papers$Refereed)

## 
##     0 papers non-refereed     refereed 
##            1         5441        14758

States with grants

df.grants <- df[!duplicated(df$GrantNumber), ]
sort(table(df.grants$GranteeOrganizationState))

## 
##  AK  ND  ME  NE  PR      MT  RI  AR  OR  WV  DE  NV  VT  UT  KS  AL  IA 
##   1   1   2   3   3   4   4   4   5   5   5   6   6   6   9  10  15  15 
##  LA  OK  WY  NH  KY  SC  MO  TN  GA  NC  IN  MN  NM  WA  CT  CO  WI  FL 
##  16  16  17  21  22  25  26  34  35  42  43  43  58  58  63  86  94  95 
##  MD  MI  NJ  OH  HI  VA  IL  DC  PA  TX  AZ  MA  NY  CA 
##  99 102 102 106 113 115 120 144 151 156 203 262 264 597

Most cited

max.citations <- max(df$CitationCount)
max.citations <- max(df$CitationCount, na.rm = T)
max.index <- which(df$CitationCount == max.citations)
max.grants <- df[max.index, "GrantNumber"]
# Note: this paper was funded by two grants!
max.subset <- subset(df, GrantNumber == max.grants[1] | GrantNumber == max.grants[2])
length(unique(max.subset$PaperTitle))

## [1] 11



# Here is a more elegant way:
max.grants <- df[which.max(df$CitationCount), "GrantNumber"]
max.subset <- subset(df, GrantNumber %in% max.grants)
length(unique(max.subset$PaperTitle))

## [1] 5

Three grants or more

papers <- aggregate(df$GrantNumber, list(df$PaperTitle), length)
length(which(papers$x > 3))

## [1] 589

Our goal!

Plot of Grants

What do we need in our data frame?

x-axis - Year of Paper y-axis - Years Since Grant color - Number of Papers

levels(df$PaperPubDate)

##   [1] ""        "00/1979" "00/1996" "00/1997" "00/1998" "00/1999" "00/2000"
##   [8] "00/2001" "00/2002" "00/2003" "00/2004" "00/2005" "00/2006" "00/2007"
##  [15] "00/2008" "00/2009" "00/2010" "00/2011" "00/2012" "00/2013" "1-Apr"  
##  [22] "1-Aug"   "1-Dec"   "1-Feb"   "1-Jan"   "1-Jul"   "1-Jun"   "1-Mar"  
##  [29] "1-May"   "1-Nov"   "1-Oct"   "1-Sep"   "10-Apr"  "10-Aug"  "10-Dec" 
##  [36] "10-Feb"  "10-Jan"  "10-Jul"  "10-Jun"  "10-Mar"  "10-May"  "10-Nov" 
##  [43] "10-Oct"  "10-Sep"  "11-Apr"  "11-Aug"  "11-Dec"  "11-Feb"  "11-Jan" 
##  [50] "11-Jul"  "11-Jun"  "11-Mar"  "11-May"  "11-Nov"  "11-Oct"  "11-Sep" 
##  [57] "12-Apr"  "12-Aug"  "12-Dec"  "12-Feb"  "12-Jan"  "12-Jul"  "12-Jun" 
##  [64] "12-Mar"  "12-May"  "12-Nov"  "12-Oct"  "12-Sep"  "13-Jan"  "2-Apr"  
##  [71] "2-Aug"   "2-Dec"   "2-Feb"   "2-Jan"   "2-Jul"   "2-Jun"   "2-Mar"  
##  [78] "2-May"   "2-Nov"   "2-Oct"   "2-Sep"   "2012"    "3-Apr"   "3-Aug"  
##  [85] "3-Dec"   "3-Feb"   "3-Jan"   "3-Jul"   "3-Jun"   "3-Mar"   "3-May"  
##  [92] "3-Nov"   "3-Oct"   "3-Sep"   "4-Apr"   "4-Aug"   "4-Dec"   "4-Feb"  
##  [99] "4-Jan"   "4-Jul"   "4-Jun"   "4-Mar"   "4-May"   "4-Nov"   "4-Oct"  
## [106] "4-Sep"   "5-Apr"   "5-Aug"   "5-Dec"   "5-Feb"   "5-Jan"   "5-Jul"  
## [113] "5-Jun"   "5-Mar"   "5-May"   "5-Nov"   "5-Oct"   "5-Sep"   "6-Apr"  
## [120] "6-Aug"   "6-Dec"   "6-Feb"   "6-Jan"   "6-Jul"   "6-Jun"   "6-Mar"  
## [127] "6-May"   "6-Nov"   "6-Oct"   "6-Sep"   "7-Apr"   "7-Aug"   "7-Dec"  
## [134] "7-Feb"   "7-Jan"   "7-Jul"   "7-Jun"   "7-Mar"   "7-May"   "7-Nov"  
## [141] "7-Oct"   "7-Sep"   "8-Apr"   "8-Aug"   "8-Dec"   "8-Feb"   "8-Jan"  
## [148] "8-Jul"   "8-Jun"   "8-Mar"   "8-May"   "8-Nov"   "8-Oct"   "8-Sep"  
## [155] "9-Apr"   "9-Aug"   "9-Dec"   "9-Feb"   "9-Jan"   "9-Jul"   "9-Jun"  
## [162] "9-Mar"   "9-May"   "9-Nov"   "9-Oct"   "9-Sep"   "Apr-00"  "Apr-79" 
## [169] "Apr-91"  "Apr-94"  "Apr-95"  "Apr-96"  "Apr-97"  "Apr-98"  "Apr-99" 
## [176] "Aug-00"  "Aug-79"  "Aug-81"  "Aug-86"  "Aug-88"  "Aug-91"  "Aug-95" 
## [183] "Aug-96"  "Aug-97"  "Aug-98"  "Aug-99"  "Dec-00"  "Dec-82"  "Dec-87" 
## [190] "Dec-90"  "Dec-95"  "Dec-96"  "Dec-97"  "Dec-98"  "Dec-99"  "Feb-00" 
## [197] "Feb-79"  "Feb-82"  "Feb-83"  "Feb-89"  "Feb-96"  "Feb-97"  "Feb-98" 
## [204] "Feb-99"  "Jan-00"  "Jan-78"  "Jan-81"  "Jan-83"  "Jan-93"  "Jan-96" 
## [211] "Jan-97"  "Jan-98"  "Jan-99"  "Jul-00"  "Jul-80"  "Jul-83"  "Jul-87" 
## [218] "Jul-88"  "Jul-94"  "Jul-95"  "Jul-96"  "Jul-97"  "Jul-98"  "Jul-99" 
## [225] "Jun-00"  "Jun-01"  "Jun-89"  "Jun-95"  "Jun-96"  "Jun-97"  "Jun-98" 
## [232] "Jun-99"  "Mar-00"  "Mar-77"  "Mar-89"  "Mar-96"  "Mar-97"  "Mar-98" 
## [239] "Mar-99"  "May-00"  "May-81"  "May-85"  "May-93"  "May-96"  "May-97" 
## [246] "May-98"  "May-99"  "Nov-00"  "Nov-95"  "Nov-96"  "Nov-97"  "Nov-98" 
## [253] "Nov-99"  "Oct-00"  "Oct-95"  "Oct-96"  "Oct-97"  "Oct-98"  "Oct-99" 
## [260] "Sep-00"  "Sep-86"  "Sep-95"  "Sep-96"  "Sep-97"  "Sep-98"  "Sep-99"

We have to use regular expressions here

yearstr <- paste(df$PaperPubDate)
unique(yearstr)

##   [1] ""        "12-Mar"  "12-Feb"  "12-May"  "11-Mar"  "9-Jan"   "12-Jun" 
##   [8] "10-Dec"  "12-Jul"  "8-Oct"   "8-Nov"   "9-Jul"   "9-Oct"   "12-Nov" 
##  [15] "10-Oct"  "12-Sep"  "11-Jan"  "10-Jan"  "12-Jan"  "11-Nov"  "10-Aug" 
##  [22] "10-Feb"  "12-Apr"  "10-Jun"  "11-Oct"  "11-Jul"  "10-Sep"  "11-Dec" 
##  [29] "11-Feb"  "11-Sep"  "12-Oct"  "00/2012" "11-Jun"  "11-May"  "9-Mar"  
##  [36] "4-Apr"   "8-Jun"   "3-Nov"   "4-Aug"   "6-Feb"   "8-Mar"   "5-Jul"  
##  [43] "4-Dec"   "2-Dec"   "5-May"   "3-Apr"   "5-Sep"   "6-Dec"   "7-Jul"  
##  [50] "5-Jun"   "5-Oct"   "4-Mar"   "10-May"  "9-Nov"   "9-Dec"   "9-Feb"  
##  [57] "7-Jun"   "9-Jun"   "9-Apr"   "7-Mar"   "7-Oct"   "7-Feb"   "6-Jun"  
##  [64] "5-Mar"   "7-Jan"   "6-Oct"   "8-Sep"   "8-Feb"   "7-Dec"   "5-Dec"  
##  [71] "5-Aug"   "9-Sep"   "3-Mar"   "7-Apr"   "4-Nov"   "3-May"   "2-Sep"  
##  [78] "11-Aug"  "12-Aug"  "10-Nov"  "11-Apr"  "10-Jul"  "8-Jan"   "10-Apr" 
##  [85] "7-Sep"   "5-Jan"   "7-Nov"   "6-Aug"   "9-May"   "4-Jun"   "00/2006"
##  [92] "5-Nov"   "9-Aug"   "8-Apr"   "7-Aug"   "8-Jul"   "10-Mar"  "7-May"  
##  [99] "4-May"   "6-Nov"   "6-Jul"   "3-Dec"   "4-Jul"   "6-May"   "5-Feb"  
## [106] "4-Jan"   "12-Dec"  "8-May"   "6-Sep"   "6-Mar"   "6-Apr"   "Jul-00" 
## [113] "1-Jun"   "1-Oct"   "8-Dec"   "00/2009" "6-Jan"   "00/2007" "4-Oct"  
## [120] "3-Oct"   "5-Apr"   "3-Aug"   "3-Jun"   "4-Sep"   "00/2003" "00/2011"
## [127] "1-Jul"   "00/2002" "2-Mar"   "3-Feb"   "2-May"   "2-Apr"   "2-Jun"  
## [134] "1-Nov"   "2-Oct"   "1-Dec"   "8-Aug"   "00/2008" "Jan-93"  "3-Sep"  
## [141] "2012"    "2-Nov"   "4-Feb"   "Dec-99"  "Feb-89"  "Aug-81"  "00/2004"
## [148] "Jul-83"  "00/2005" "3-Jan"   "3-Jul"   "00/2010" "2-Feb"   "2-Jul"  
## [155] "2-Aug"   "May-85"  "Jul-87"  "2-Jan"   "Jul-97"  "Aug-79"  "00/1979"
## [162] "Apr-79"  "Jul-80"  "May-81"  "Mar-89"  "Feb-82"  "Dec-87"  "Dec-90" 
## [169] "Aug-88"  "Oct-00"  "Sep-97"  "Aug-96"  "1-Apr"   "00/2001" "1-Sep"  
## [176] "Dec-00"  "1-May"   "Sep-00"  "Jan-00"  "Feb-00"  "Sep-99"  "Nov-99" 
## [183] "Mar-77"  "Aug-91"  "13-Jan"  "Nov-00"  "00/2013" "1-Mar"   "1-Jan"  
## [190] "1-Feb"   "Jan-99"  "Jan-78"  "Feb-83"  "Jan-81"  "Jan-83"  "1-Aug"  
## [197] "Apr-91"  "Mar-96"  "Aug-95"  "Feb-79"  "Mar-00"  "Sep-86"  "Jul-96" 
## [204] "Jul-88"  "Feb-99"  "May-93"  "Apr-94"  "Aug-86"  "00/2000" "Apr-99" 
## [211] "Aug-00"  "Dec-82"  "Jul-94"  "May-00"  "Jun-89"  "Jun-00"  "May-99" 
## [218] "Aug-99"  "Apr-97"  "00/1998" "00/1999" "Aug-97"  "May-97"  "Nov-97" 
## [225] "Jul-99"  "Mar-97"  "Mar-98"  "Apr-00"  "Jun-99"  "Aug-98"  "May-98" 
## [232] "Dec-97"  "Nov-98"  "Jun-98"  "Oct-99"  "Dec-96"  "00/1996" "Jul-98" 
## [239] "Apr-98"  "Jan-98"  "Oct-98"  "Mar-99"  "00/1997" "Feb-98"  "Sep-98" 
## [246] "Dec-98"  "Oct-97"  "Jan-97"  "Oct-96"  "Jun-97"  "Jun-01"  "Nov-96" 
## [253] "Sep-96"  "Jun-96"  "Feb-97"  "Dec-95"  "Nov-95"  "May-96"  "Oct-95" 
## [260] "Apr-96"  "Jan-96"  "Sep-95"  "Jun-95"  "Feb-96"  "Jul-95"  "Apr-95"

yearstr <- gsub(".*/", "", yearstr, perl = TRUE)
unique(yearstr)

##   [1] ""       "12-Mar" "12-Feb" "12-May" "11-Mar" "9-Jan"  "12-Jun"
##   [8] "10-Dec" "12-Jul" "8-Oct"  "8-Nov"  "9-Jul"  "9-Oct"  "12-Nov"
##  [15] "10-Oct" "12-Sep" "11-Jan" "10-Jan" "12-Jan" "11-Nov" "10-Aug"
##  [22] "10-Feb" "12-Apr" "10-Jun" "11-Oct" "11-Jul" "10-Sep" "11-Dec"
##  [29] "11-Feb" "11-Sep" "12-Oct" "2012"   "11-Jun" "11-May" "9-Mar" 
##  [36] "4-Apr"  "8-Jun"  "3-Nov"  "4-Aug"  "6-Feb"  "8-Mar"  "5-Jul" 
##  [43] "4-Dec"  "2-Dec"  "5-May"  "3-Apr"  "5-Sep"  "6-Dec"  "7-Jul" 
##  [50] "5-Jun"  "5-Oct"  "4-Mar"  "10-May" "9-Nov"  "9-Dec"  "9-Feb" 
##  [57] "7-Jun"  "9-Jun"  "9-Apr"  "7-Mar"  "7-Oct"  "7-Feb"  "6-Jun" 
##  [64] "5-Mar"  "7-Jan"  "6-Oct"  "8-Sep"  "8-Feb"  "7-Dec"  "5-Dec" 
##  [71] "5-Aug"  "9-Sep"  "3-Mar"  "7-Apr"  "4-Nov"  "3-May"  "2-Sep" 
##  [78] "11-Aug" "12-Aug" "10-Nov" "11-Apr" "10-Jul" "8-Jan"  "10-Apr"
##  [85] "7-Sep"  "5-Jan"  "7-Nov"  "6-Aug"  "9-May"  "4-Jun"  "2006"  
##  [92] "5-Nov"  "9-Aug"  "8-Apr"  "7-Aug"  "8-Jul"  "10-Mar" "7-May" 
##  [99] "4-May"  "6-Nov"  "6-Jul"  "3-Dec"  "4-Jul"  "6-May"  "5-Feb" 
## [106] "4-Jan"  "12-Dec" "8-May"  "6-Sep"  "6-Mar"  "6-Apr"  "Jul-00"
## [113] "1-Jun"  "1-Oct"  "8-Dec"  "2009"   "6-Jan"  "2007"   "4-Oct" 
## [120] "3-Oct"  "5-Apr"  "3-Aug"  "3-Jun"  "4-Sep"  "2003"   "2011"  
## [127] "1-Jul"  "2002"   "2-Mar"  "3-Feb"  "2-May"  "2-Apr"  "2-Jun" 
## [134] "1-Nov"  "2-Oct"  "1-Dec"  "8-Aug"  "2008"   "Jan-93" "3-Sep" 
## [141] "2-Nov"  "4-Feb"  "Dec-99" "Feb-89" "Aug-81" "2004"   "Jul-83"
## [148] "2005"   "3-Jan"  "3-Jul"  "2010"   "2-Feb"  "2-Jul"  "2-Aug" 
## [155] "May-85" "Jul-87" "2-Jan"  "Jul-97" "Aug-79" "1979"   "Apr-79"
## [162] "Jul-80" "May-81" "Mar-89" "Feb-82" "Dec-87" "Dec-90" "Aug-88"
## [169] "Oct-00" "Sep-97" "Aug-96" "1-Apr"  "2001"   "1-Sep"  "Dec-00"
## [176] "1-May"  "Sep-00" "Jan-00" "Feb-00" "Sep-99" "Nov-99" "Mar-77"
## [183] "Aug-91" "13-Jan" "Nov-00" "2013"   "1-Mar"  "1-Jan"  "1-Feb" 
## [190] "Jan-99" "Jan-78" "Feb-83" "Jan-81" "Jan-83" "1-Aug"  "Apr-91"
## [197] "Mar-96" "Aug-95" "Feb-79" "Mar-00" "Sep-86" "Jul-96" "Jul-88"
## [204] "Feb-99" "May-93" "Apr-94" "Aug-86" "2000"   "Apr-99" "Aug-00"
## [211] "Dec-82" "Jul-94" "May-00" "Jun-89" "Jun-00" "May-99" "Aug-99"
## [218] "Apr-97" "1998"   "1999"   "Aug-97" "May-97" "Nov-97" "Jul-99"
## [225] "Mar-97" "Mar-98" "Apr-00" "Jun-99" "Aug-98" "May-98" "Dec-97"
## [232] "Nov-98" "Jun-98" "Oct-99" "Dec-96" "1996"   "Jul-98" "Apr-98"
## [239] "Jan-98" "Oct-98" "Mar-99" "1997"   "Feb-98" "Sep-98" "Dec-98"
## [246] "Oct-97" "Jan-97" "Oct-96" "Jun-97" "Jun-01" "Nov-96" "Sep-96"
## [253] "Jun-96" "Feb-97" "Dec-95" "Nov-95" "May-96" "Oct-95" "Apr-96"
## [260] "Jan-96" "Sep-95" "Jun-95" "Feb-96" "Jul-95" "Apr-95"

yearstr <- gsub("\\D*", "", yearstr, perl = TRUE)
unique(yearstr)

##  [1] ""     "12"   "11"   "9"    "10"   "8"    "2012" "4"    "3"    "6"   
## [11] "5"    "2"    "7"    "2006" "00"   "1"    "2009" "2007" "2003" "2011"
## [21] "2002" "2008" "93"   "99"   "89"   "81"   "2004" "83"   "2005" "2010"
## [31] "85"   "87"   "97"   "79"   "1979" "80"   "82"   "90"   "88"   "96"  
## [41] "2001" "77"   "91"   "13"   "2013" "78"   "95"   "86"   "94"   "2000"
## [51] "1998" "1999" "98"   "1996" "1997" "01"

changeinds <- which(as.numeric(yearstr) < 14)
changevals <- as.numeric(yearstr[changeinds]) + 2000
yearstr[changeinds] <- as.character(changevals)
unique(yearstr)

##  [1] ""     "2012" "2011" "2009" "2010" "2008" "2004" "2003" "2006" "2005"
## [11] "2002" "2007" "2000" "2001" "93"   "99"   "89"   "81"   "83"   "85"  
## [21] "87"   "97"   "79"   "1979" "80"   "82"   "90"   "88"   "96"   "77"  
## [31] "91"   "2013" "78"   "95"   "86"   "94"   "1998" "1999" "98"   "1996"
## [41] "1997"

changeinds <- which(unlist(lapply(yearstr, nchar)) == 2)
changevals <- paste("19", yearstr[changeinds], sep = "")
yearstr[changeinds] <- changevals
df$PaperPubDateNumeric <- as.numeric(yearstr)

Now make a new variable for years since grant

df$YearsSinceGrant <- df$PaperPubDateNumeric - df$GrantStartYear

Remove bad rows

Rows with no papers, papers published prior to grant years (?!) and unrefereed papers.

df.filt <- subset(df, YearsSinceGrant >= 0)
df.filt <- subset(df.filt, Refereed == "refereed")

Basic Plots

ggplot(data = df.filt, aes(x = YearsSinceGrant)) + geom_bar()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-25

ggplot maps various aesthetics to components of the data. It then draws things based on these aesthetics, and any transformations.

ggplot(data = df.filt, aes(x = YearsSinceGrant, y = CitationCount)) + geom_point()

## Warning: Removed 60 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-26

library(scales)
ggplot(data = df.filt, aes(x = YearsSinceGrant, y = CitationCount)) + geom_point() + 
    scale_y_log10()

## Warning: Removed 60 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-27

Final Plot

d <- ggplot(df.filt, aes(x = GrantStartYear, y = YearsSinceGrant)) + xlim(1995, 
    2011) + ylim(0, 16)
d <- d + geom_bin2d(binwidth = c(1, 1))
d

plot of chunk unnamed-chunk-28

Change the color

d <- d + scale_fill_gradient(low = "paleturquoise1", high = "midnightblue", 
    name = "# Papers")
d

plot of chunk unnamed-chunk-29

Change the theme

d <- d + theme_bw()
d <- d + theme(text = element_text(size = 24))
d <- d + theme(panel.grid = element_blank())
d

plot of chunk unnamed-chunk-30

Add a line

d <- d + geom_abline(intercept = 2012, slope = -1, size = 2, linetype = "dashed")
d

plot of chunk unnamed-chunk-31

Add a smoother

d <- d + geom_smooth(color = "red")
d

## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-32

Change the legend location

library(grid)
d <- d + theme(legend.direction = "horizontal", legend.position = c(0.75, 0.8), 
    legend.key.width = unit(1, "cm"), legend.title.align = 1)
d

## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-33

Change axis labels

d <- d + xlab("Grant Start Year") + ylab("Years Since Grant Start")
d <- d + ggtitle("References to NSF ASF Grants")
d

## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-34

Label the future

d <- d + annotate("text", x = 2002, y = 11, label = "The Future", size = 7, 
    angle = -45, fontface = "italic") + coord_fixed()
d

## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.

## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-35

Improvements

This figure is, I think, just OK. There are a lot of colors, and it can be hard to compare across colors sometimes. We also aren't comparing apples to apples when it comes to 1995's papers and 2010's papers, by virtue of there being fewer years from which to sample.

Probably the easiest way to compare is just to plot the paper production in the first year for all years.

Look just at the first year post-grant

my.subset <- subset(df.filt, YearsSinceGrant <= 1)
d <- ggplot(data = my.subset, aes(x = GrantStartYear)) + geom_histogram(binwidth = 1)

We can also use the stat_bin command with the geom option to plot this as a line instead of as bars.

d <- ggplot(data = my.subset, aes(x = GrantStartYear)) + stat_bin(binwidth = 1, 
    geom = "line")
d

## ymax not defined: adjusting position using y instead

plot of chunk unnamed-chunk-37

We can also use the stat_bin command with the geom option to plot this as a line instead of as bars.

Clearly, the number of papers in the first year has increased over time. You might ask, is this related to the number of grants? Is this related to the total amount of money? Let's compute and plot these.

unique.grants <- unique(my.subset[, c("GrantStartYear", "GrantNumber", "AwardedUSD")])
a.money <- aggregate(unique.grants$AwardedUSD, list(unique.grants$GrantStartYear), 
    sum)
a.grants <- aggregate(unique.grants$GrantStartYear, list(unique.grants$GrantStartYear), 
    length)
a.papers <- aggregate(my.subset$PaperTitle, list(my.subset$GrantStartYear), 
    length)
new.df <- data.frame(year = a.money$Group.1, nGrants = a.grants$x/max(a.grants$x), 
    nDollars = a.money$x/max(a.money$x), nPapers = a.papers$x/max(a.papers$x))

ggplot(data = new.df, aes(x = year)) + geom_line(aes(y = nGrants, color = "Grants")) + 
    geom_line(aes(y = nDollars, color = "Dollars")) + geom_line(aes(y = nPapers, 
    color = "Papers")) + scale_colour_manual("", breaks = c("Grants", "Dollars", 
    "Papers"), values = c("red", "black", "blue")) + ylab("Number (relative to maximum)") + 
    xlab("Year") + ggtitle("Grants, Funding and Publications over Time")

plot of chunk unnamed-chunk-38

Details on this and other styles are available at http://stackoverflow.com/questions/10349206/add-legend-to-ggplot2-line-plot

We can also display this information in a different way, using densities as opposed to a 2-d heat map.

ggplot(data = df.filt) + geom_density(aes(x = YearsSinceGrant, color = factor(GrantStartYear), 
    fill = factor(GrantStartYear)), adjust = 3, alpha = 0.2)

plot of chunk unnamed-chunk-39

This is still lacking. We'd like to split this up somehow to see what's going on more clearly!

ggplot(data = df.filt) + geom_density(aes(x = YearsSinceGrant, color = factor(GrantStartYear), 
    fill = factor(GrantStartYear)), adjust = 3, alpha = 0.2) + facet_grid(Administration ~ 
    .)

plot of chunk unnamed-chunk-40