Introduction to R

This file is online!

http://rpubs.com/alexplanation/dst4l

https://gist.github.com/alexstorer/5460479

R is an interpreted language like Python

"hello world"
## [1] "hello world"

If you use NumPy/SciPy, here is a conversion to R: http://mathesaurus.sourceforge.net/matlab-python-xref.pdf

R is better than Python

The language is designed to make statistics and data handling easy

mtcars["Honda Civic", "mpg"]
## [1] 30.4
lm(mpg ~ cyl + gear + wt, data = mtcars)
## 
## Call:
## lm(formula = mpg ~ cyl + gear + wt, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl         gear           wt  
##      42.386       -1.528       -0.523       -3.392

Math, Statistics and Matrices work out of the box

2 + 2
## [1] 4
log(2^(0.25))
## [1] 0.1733

Install add ons from inside R

# install.packages('ggplot2')
library(ggplot2)

Superb graphing tools

ggplot(data = mtcars, aes(x = mpg, y = wt, color = cyl)) + geom_point()

plot of chunk unnamed-chunk-5

Incredible IDE: RStudio

RStudio Screenshot

You can even run RStudio on the internet, and set up your own FREE Amazon instance to run it:

http://www.louisaslett.com/RStudio_AMI/

CRAN Task Views (http://cran.r-project.org/web/views/) summarize 3rd party packages that do many types of anaylsis

Python is better than R

Installed by default

Syntax is much easier to learn and use

Powerful and intuitive data structures

Outstanding text processing

RPy2 lets you use R from Python

In practice, I use both!


Today's Goal: Getting Started With R

Today's Strategy

Jump in with both feet, worry about the details later! More time spent doing.

Let's get data into R

df <- read.csv("http://files.figshare.com/983501/nsfastgrantbib20121130.csv")

Data Frame examples

names(df)
##  [1] "GrantPaperID"             "NSFOrg"                  
##  [3] "GrantNumber"              "GrantTitle"              
##  [5] "AwardedUSD"               "ARRAAmount"              
##  [7] "GrantLink"                "PaperTitle"              
##  [9] "PaperJournal"             "PaperBibcode"            
## [11] "PaperLink"                "Refereed"                
## [13] "CitationCount"            "NSFPrograms"             
## [15] "GrantStartMonth"          "GrantStartYear"          
## [17] "PaperPubDate"             "PIFirstName"             
## [19] "PILastName"               "PaperAuthor"             
## [21] "PIMatch"                  "GranteeOrganization"     
## [23] "GranteeOrganizationState" "GranteeOrganizationZip"  
## [25] "NSFDirectorate"           "PaperAuthorAffiliation"  
## [27] "PaperAuthorCountry"       "PaperKeywords"
df$PaperTitle[100]
## [1] Dust, Atomic, and Molecular Gas in the Nearest Primitive Environment
## 20200 Levels:  ...
df[100, "PaperTitle"]
## [1] Dust, Atomic, and Molecular Gas in the Nearest Primitive Environment
## 20200 Levels:  ...
head(df$PaperTitle)
## [1] 0 papers                                                                                                                                             
## [2] 0 papers                                                                                                                                             
## [3] Spitzer Evidence for a Late-heavy Bombardment and the Formation of Ureilites in η Corvi at ~1 Gyr                                                    
## [4] Spitzer Evidence for a Late Heavy Bombardment and the Formation of Urelites in η Corvi at ~1 Gyr                                                    
## [5] Mid-infrared spectra of differentiated meteorites (achondrites): Comparison with astronomical observations of dust in protoplanetary and debris disks
## [6] Spitzer Observations of η Corvi: Evidence at ~1 Gyr for an LHB-Like Delivery of Organics and Water-Rich Material to the THZ of a Sun-Like Star       
## 20200 Levels:  ...

Returning subsets of data frames

df[5:10, c("PaperTitle", "GrantTitle")]
##                                                                                                                                               PaperTitle
## 5  Mid-infrared spectra of differentiated meteorites (achondrites): Comparison with astronomical observations of dust in protoplanetary and debris disks
## 6         Spitzer Observations of η Corvi: Evidence at ~1 Gyr for an LHB-Like Delivery of Organics and Water-Rich Material to the THZ of a Sun-Like Star
## 7                                                                                                                            Ionization of Infalling Gas
## 8                                                                         Sodium atoms in the lunar exotail: Observed velocity and spatial distributions
## 9                                                                                   Measurement of a Magnetic Field in a Leading Arm High-velocity Cloud
## 10                                                                    Vertical Structure of a Supernova-driven Turbulent, Magnetized Interstellar Medium
##                                                                                        GrantTitle
## 5                            Infrared Studies of Circumstellar Dust in Planet-Forming Exo-Systems
## 6                            Infrared Studies of Circumstellar Dust in Planet-Forming Exo-Systems
## 7  Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 8  Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 9  Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 10 Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper

The c command, for combine, combines variables into arrays. The : operator returns a vector of values

df[seq(5, 10, 2), c("GrantStartYear", "GrantStartMonth")]
##   GrantStartYear GrantStartMonth
## 5           2009               9
## 7           2006               9
## 9           2006               9

The seq command is a more flexible version of :

A note on this data

How many papers have more than 50 citations?

head(df$CitationCount > 50)
## [1]    NA    NA FALSE FALSE FALSE FALSE
table(df$CitationCount > 50)
## 
## FALSE  TRUE 
## 20884  4827
head(which(df$CitationCount > 50))
## [1]  89  90  98 135 136 140
paste("There are", length(which(df$CitationCount > 50)), "papers")
## [1] "There are 4827 papers"

But these are not unique!

numPapers <- length(unique(df[df$CitationCount > 50, "PaperTitle"]))

How many different grants are in our data set?

# Try this one yourself!

The duplicated function

Like unique but gives us the indices (logical or otherwise).

table(duplicated(df$GrantNumber))
## 
## FALSE  TRUE 
##  3432 26075

What is the mean of the money awarded for papers with at least 50 Citations? Less than 10 citations?

popular.papers <- subset(df, CitationCount > 50)
less.popular.papers <- subset(df, CitationCount < 10)
mpp <- mean(popular.papers$AwardedUSD)
lpp <- mean(less.popular.papers$AwardedUSD)
paste(">50:", mpp, "<10", lpp)
## [1] ">50: 1184260.64926455 <10 1482004.83885319"

Make a table of award amounts by presidential administration

df$AwardAmountGroups <- cut(df$AwardedUSD, 3, labels = c("low", "med", "hi"))
df$Administration <- cut(df$GrantStartYear, breaks = c(1992, 2000, 2008, 2013), 
    labels = c("Clinton", "Bush", "Obama"))

with(df, table(Administration, AwardAmountGroups))
##               AwardAmountGroups
## Administration   low   med    hi
##        Clinton  9318    52     0
##        Bush    16348     3     2
##        Obama    3784     0     0

Exercises

Solutions

df.grants <- df[!duplicated(df$GrantNumber), ]
table(df.grants$GrantStartMonth)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  76  53 110 115 187 333 723 707 870 193  33  32
plot(table(df.grants$GrantStartMonth))

plot of chunk unnamed-chunk-16

df.papers <- df[!duplicated(df$PaperTitle), ]
table(df.papers$Refereed)
## 
##     0 papers non-refereed     refereed 
##            1         5441        14758
df.grants <- df[!duplicated(df$GrantNumber), ]
sort(table(df.grants$GranteeOrganizationState))
## 
##  AK  ND  ME  NE  PR      MT  RI  AR  OR  WV  DE  NV  VT  UT  KS  AL  IA 
##   1   1   2   3   3   4   4   4   5   5   5   6   6   6   9  10  15  15 
##  LA  OK  WY  NH  KY  SC  MO  TN  GA  NC  IN  MN  NM  WA  CT  CO  WI  FL 
##  16  16  17  21  22  25  26  34  35  42  43  43  58  58  63  86  94  95 
##  MD  MI  NJ  OH  HI  VA  IL  DC  PA  TX  AZ  MA  NY  CA 
##  99 102 102 106 113 115 120 144 151 156 203 262 264 597
max.citations <- max(df$CitationCount)
max.citations <- max(df$CitationCount, na.rm = T)
max.index <- which(df$CitationCount == max.citations)
max.grants <- df[max.index, "GrantNumber"]
# Note: this paper was funded by two grants!
max.subset <- subset(df, GrantNumber == max.grants[1] | GrantNumber == max.grants[2])
length(unique(max.subset$PaperTitle))
## [1] 11


# Here is a more elegant way:
max.grants <- df[which.max(df$CitationCount), "GrantNumber"]
max.subset <- subset(df, GrantNumber %in% max.grants)
length(unique(max.subset$PaperTitle))
## [1] 5
papers <- aggregate(df$GrantNumber, list(df$PaperTitle), length)
length(which(papers$x > 3))
## [1] 589

Our goal!

Plot of Grants

What do we need in our data frame?

x-axis - Year of Paper y-axis - Years Since Grant color - Number of Papers

levels(df$PaperPubDate)
##   [1] ""        "00/1979" "00/1996" "00/1997" "00/1998" "00/1999" "00/2000"
##   [8] "00/2001" "00/2002" "00/2003" "00/2004" "00/2005" "00/2006" "00/2007"
##  [15] "00/2008" "00/2009" "00/2010" "00/2011" "00/2012" "00/2013" "1-Apr"  
##  [22] "1-Aug"   "1-Dec"   "1-Feb"   "1-Jan"   "1-Jul"   "1-Jun"   "1-Mar"  
##  [29] "1-May"   "1-Nov"   "1-Oct"   "1-Sep"   "10-Apr"  "10-Aug"  "10-Dec" 
##  [36] "10-Feb"  "10-Jan"  "10-Jul"  "10-Jun"  "10-Mar"  "10-May"  "10-Nov" 
##  [43] "10-Oct"  "10-Sep"  "11-Apr"  "11-Aug"  "11-Dec"  "11-Feb"  "11-Jan" 
##  [50] "11-Jul"  "11-Jun"  "11-Mar"  "11-May"  "11-Nov"  "11-Oct"  "11-Sep" 
##  [57] "12-Apr"  "12-Aug"  "12-Dec"  "12-Feb"  "12-Jan"  "12-Jul"  "12-Jun" 
##  [64] "12-Mar"  "12-May"  "12-Nov"  "12-Oct"  "12-Sep"  "13-Jan"  "2-Apr"  
##  [71] "2-Aug"   "2-Dec"   "2-Feb"   "2-Jan"   "2-Jul"   "2-Jun"   "2-Mar"  
##  [78] "2-May"   "2-Nov"   "2-Oct"   "2-Sep"   "2012"    "3-Apr"   "3-Aug"  
##  [85] "3-Dec"   "3-Feb"   "3-Jan"   "3-Jul"   "3-Jun"   "3-Mar"   "3-May"  
##  [92] "3-Nov"   "3-Oct"   "3-Sep"   "4-Apr"   "4-Aug"   "4-Dec"   "4-Feb"  
##  [99] "4-Jan"   "4-Jul"   "4-Jun"   "4-Mar"   "4-May"   "4-Nov"   "4-Oct"  
## [106] "4-Sep"   "5-Apr"   "5-Aug"   "5-Dec"   "5-Feb"   "5-Jan"   "5-Jul"  
## [113] "5-Jun"   "5-Mar"   "5-May"   "5-Nov"   "5-Oct"   "5-Sep"   "6-Apr"  
## [120] "6-Aug"   "6-Dec"   "6-Feb"   "6-Jan"   "6-Jul"   "6-Jun"   "6-Mar"  
## [127] "6-May"   "6-Nov"   "6-Oct"   "6-Sep"   "7-Apr"   "7-Aug"   "7-Dec"  
## [134] "7-Feb"   "7-Jan"   "7-Jul"   "7-Jun"   "7-Mar"   "7-May"   "7-Nov"  
## [141] "7-Oct"   "7-Sep"   "8-Apr"   "8-Aug"   "8-Dec"   "8-Feb"   "8-Jan"  
## [148] "8-Jul"   "8-Jun"   "8-Mar"   "8-May"   "8-Nov"   "8-Oct"   "8-Sep"  
## [155] "9-Apr"   "9-Aug"   "9-Dec"   "9-Feb"   "9-Jan"   "9-Jul"   "9-Jun"  
## [162] "9-Mar"   "9-May"   "9-Nov"   "9-Oct"   "9-Sep"   "Apr-00"  "Apr-79" 
## [169] "Apr-91"  "Apr-94"  "Apr-95"  "Apr-96"  "Apr-97"  "Apr-98"  "Apr-99" 
## [176] "Aug-00"  "Aug-79"  "Aug-81"  "Aug-86"  "Aug-88"  "Aug-91"  "Aug-95" 
## [183] "Aug-96"  "Aug-97"  "Aug-98"  "Aug-99"  "Dec-00"  "Dec-82"  "Dec-87" 
## [190] "Dec-90"  "Dec-95"  "Dec-96"  "Dec-97"  "Dec-98"  "Dec-99"  "Feb-00" 
## [197] "Feb-79"  "Feb-82"  "Feb-83"  "Feb-89"  "Feb-96"  "Feb-97"  "Feb-98" 
## [204] "Feb-99"  "Jan-00"  "Jan-78"  "Jan-81"  "Jan-83"  "Jan-93"  "Jan-96" 
## [211] "Jan-97"  "Jan-98"  "Jan-99"  "Jul-00"  "Jul-80"  "Jul-83"  "Jul-87" 
## [218] "Jul-88"  "Jul-94"  "Jul-95"  "Jul-96"  "Jul-97"  "Jul-98"  "Jul-99" 
## [225] "Jun-00"  "Jun-01"  "Jun-89"  "Jun-95"  "Jun-96"  "Jun-97"  "Jun-98" 
## [232] "Jun-99"  "Mar-00"  "Mar-77"  "Mar-89"  "Mar-96"  "Mar-97"  "Mar-98" 
## [239] "Mar-99"  "May-00"  "May-81"  "May-85"  "May-93"  "May-96"  "May-97" 
## [246] "May-98"  "May-99"  "Nov-00"  "Nov-95"  "Nov-96"  "Nov-97"  "Nov-98" 
## [253] "Nov-99"  "Oct-00"  "Oct-95"  "Oct-96"  "Oct-97"  "Oct-98"  "Oct-99" 
## [260] "Sep-00"  "Sep-86"  "Sep-95"  "Sep-96"  "Sep-97"  "Sep-98"  "Sep-99"

We have to use regular expressions here

yearstr <- paste(df$PaperPubDate)
unique(yearstr)
##   [1] ""        "12-Mar"  "12-Feb"  "12-May"  "11-Mar"  "9-Jan"   "12-Jun" 
##   [8] "10-Dec"  "12-Jul"  "8-Oct"   "8-Nov"   "9-Jul"   "9-Oct"   "12-Nov" 
##  [15] "10-Oct"  "12-Sep"  "11-Jan"  "10-Jan"  "12-Jan"  "11-Nov"  "10-Aug" 
##  [22] "10-Feb"  "12-Apr"  "10-Jun"  "11-Oct"  "11-Jul"  "10-Sep"  "11-Dec" 
##  [29] "11-Feb"  "11-Sep"  "12-Oct"  "00/2012" "11-Jun"  "11-May"  "9-Mar"  
##  [36] "4-Apr"   "8-Jun"   "3-Nov"   "4-Aug"   "6-Feb"   "8-Mar"   "5-Jul"  
##  [43] "4-Dec"   "2-Dec"   "5-May"   "3-Apr"   "5-Sep"   "6-Dec"   "7-Jul"  
##  [50] "5-Jun"   "5-Oct"   "4-Mar"   "10-May"  "9-Nov"   "9-Dec"   "9-Feb"  
##  [57] "7-Jun"   "9-Jun"   "9-Apr"   "7-Mar"   "7-Oct"   "7-Feb"   "6-Jun"  
##  [64] "5-Mar"   "7-Jan"   "6-Oct"   "8-Sep"   "8-Feb"   "7-Dec"   "5-Dec"  
##  [71] "5-Aug"   "9-Sep"   "3-Mar"   "7-Apr"   "4-Nov"   "3-May"   "2-Sep"  
##  [78] "11-Aug"  "12-Aug"  "10-Nov"  "11-Apr"  "10-Jul"  "8-Jan"   "10-Apr" 
##  [85] "7-Sep"   "5-Jan"   "7-Nov"   "6-Aug"   "9-May"   "4-Jun"   "00/2006"
##  [92] "5-Nov"   "9-Aug"   "8-Apr"   "7-Aug"   "8-Jul"   "10-Mar"  "7-May"  
##  [99] "4-May"   "6-Nov"   "6-Jul"   "3-Dec"   "4-Jul"   "6-May"   "5-Feb"  
## [106] "4-Jan"   "12-Dec"  "8-May"   "6-Sep"   "6-Mar"   "6-Apr"   "Jul-00" 
## [113] "1-Jun"   "1-Oct"   "8-Dec"   "00/2009" "6-Jan"   "00/2007" "4-Oct"  
## [120] "3-Oct"   "5-Apr"   "3-Aug"   "3-Jun"   "4-Sep"   "00/2003" "00/2011"
## [127] "1-Jul"   "00/2002" "2-Mar"   "3-Feb"   "2-May"   "2-Apr"   "2-Jun"  
## [134] "1-Nov"   "2-Oct"   "1-Dec"   "8-Aug"   "00/2008" "Jan-93"  "3-Sep"  
## [141] "2012"    "2-Nov"   "4-Feb"   "Dec-99"  "Feb-89"  "Aug-81"  "00/2004"
## [148] "Jul-83"  "00/2005" "3-Jan"   "3-Jul"   "00/2010" "2-Feb"   "2-Jul"  
## [155] "2-Aug"   "May-85"  "Jul-87"  "2-Jan"   "Jul-97"  "Aug-79"  "00/1979"
## [162] "Apr-79"  "Jul-80"  "May-81"  "Mar-89"  "Feb-82"  "Dec-87"  "Dec-90" 
## [169] "Aug-88"  "Oct-00"  "Sep-97"  "Aug-96"  "1-Apr"   "00/2001" "1-Sep"  
## [176] "Dec-00"  "1-May"   "Sep-00"  "Jan-00"  "Feb-00"  "Sep-99"  "Nov-99" 
## [183] "Mar-77"  "Aug-91"  "13-Jan"  "Nov-00"  "00/2013" "1-Mar"   "1-Jan"  
## [190] "1-Feb"   "Jan-99"  "Jan-78"  "Feb-83"  "Jan-81"  "Jan-83"  "1-Aug"  
## [197] "Apr-91"  "Mar-96"  "Aug-95"  "Feb-79"  "Mar-00"  "Sep-86"  "Jul-96" 
## [204] "Jul-88"  "Feb-99"  "May-93"  "Apr-94"  "Aug-86"  "00/2000" "Apr-99" 
## [211] "Aug-00"  "Dec-82"  "Jul-94"  "May-00"  "Jun-89"  "Jun-00"  "May-99" 
## [218] "Aug-99"  "Apr-97"  "00/1998" "00/1999" "Aug-97"  "May-97"  "Nov-97" 
## [225] "Jul-99"  "Mar-97"  "Mar-98"  "Apr-00"  "Jun-99"  "Aug-98"  "May-98" 
## [232] "Dec-97"  "Nov-98"  "Jun-98"  "Oct-99"  "Dec-96"  "00/1996" "Jul-98" 
## [239] "Apr-98"  "Jan-98"  "Oct-98"  "Mar-99"  "00/1997" "Feb-98"  "Sep-98" 
## [246] "Dec-98"  "Oct-97"  "Jan-97"  "Oct-96"  "Jun-97"  "Jun-01"  "Nov-96" 
## [253] "Sep-96"  "Jun-96"  "Feb-97"  "Dec-95"  "Nov-95"  "May-96"  "Oct-95" 
## [260] "Apr-96"  "Jan-96"  "Sep-95"  "Jun-95"  "Feb-96"  "Jul-95"  "Apr-95"
yearstr <- gsub(".*/", "", yearstr, perl = TRUE)
unique(yearstr)
##   [1] ""       "12-Mar" "12-Feb" "12-May" "11-Mar" "9-Jan"  "12-Jun"
##   [8] "10-Dec" "12-Jul" "8-Oct"  "8-Nov"  "9-Jul"  "9-Oct"  "12-Nov"
##  [15] "10-Oct" "12-Sep" "11-Jan" "10-Jan" "12-Jan" "11-Nov" "10-Aug"
##  [22] "10-Feb" "12-Apr" "10-Jun" "11-Oct" "11-Jul" "10-Sep" "11-Dec"
##  [29] "11-Feb" "11-Sep" "12-Oct" "2012"   "11-Jun" "11-May" "9-Mar" 
##  [36] "4-Apr"  "8-Jun"  "3-Nov"  "4-Aug"  "6-Feb"  "8-Mar"  "5-Jul" 
##  [43] "4-Dec"  "2-Dec"  "5-May"  "3-Apr"  "5-Sep"  "6-Dec"  "7-Jul" 
##  [50] "5-Jun"  "5-Oct"  "4-Mar"  "10-May" "9-Nov"  "9-Dec"  "9-Feb" 
##  [57] "7-Jun"  "9-Jun"  "9-Apr"  "7-Mar"  "7-Oct"  "7-Feb"  "6-Jun" 
##  [64] "5-Mar"  "7-Jan"  "6-Oct"  "8-Sep"  "8-Feb"  "7-Dec"  "5-Dec" 
##  [71] "5-Aug"  "9-Sep"  "3-Mar"  "7-Apr"  "4-Nov"  "3-May"  "2-Sep" 
##  [78] "11-Aug" "12-Aug" "10-Nov" "11-Apr" "10-Jul" "8-Jan"  "10-Apr"
##  [85] "7-Sep"  "5-Jan"  "7-Nov"  "6-Aug"  "9-May"  "4-Jun"  "2006"  
##  [92] "5-Nov"  "9-Aug"  "8-Apr"  "7-Aug"  "8-Jul"  "10-Mar" "7-May" 
##  [99] "4-May"  "6-Nov"  "6-Jul"  "3-Dec"  "4-Jul"  "6-May"  "5-Feb" 
## [106] "4-Jan"  "12-Dec" "8-May"  "6-Sep"  "6-Mar"  "6-Apr"  "Jul-00"
## [113] "1-Jun"  "1-Oct"  "8-Dec"  "2009"   "6-Jan"  "2007"   "4-Oct" 
## [120] "3-Oct"  "5-Apr"  "3-Aug"  "3-Jun"  "4-Sep"  "2003"   "2011"  
## [127] "1-Jul"  "2002"   "2-Mar"  "3-Feb"  "2-May"  "2-Apr"  "2-Jun" 
## [134] "1-Nov"  "2-Oct"  "1-Dec"  "8-Aug"  "2008"   "Jan-93" "3-Sep" 
## [141] "2-Nov"  "4-Feb"  "Dec-99" "Feb-89" "Aug-81" "2004"   "Jul-83"
## [148] "2005"   "3-Jan"  "3-Jul"  "2010"   "2-Feb"  "2-Jul"  "2-Aug" 
## [155] "May-85" "Jul-87" "2-Jan"  "Jul-97" "Aug-79" "1979"   "Apr-79"
## [162] "Jul-80" "May-81" "Mar-89" "Feb-82" "Dec-87" "Dec-90" "Aug-88"
## [169] "Oct-00" "Sep-97" "Aug-96" "1-Apr"  "2001"   "1-Sep"  "Dec-00"
## [176] "1-May"  "Sep-00" "Jan-00" "Feb-00" "Sep-99" "Nov-99" "Mar-77"
## [183] "Aug-91" "13-Jan" "Nov-00" "2013"   "1-Mar"  "1-Jan"  "1-Feb" 
## [190] "Jan-99" "Jan-78" "Feb-83" "Jan-81" "Jan-83" "1-Aug"  "Apr-91"
## [197] "Mar-96" "Aug-95" "Feb-79" "Mar-00" "Sep-86" "Jul-96" "Jul-88"
## [204] "Feb-99" "May-93" "Apr-94" "Aug-86" "2000"   "Apr-99" "Aug-00"
## [211] "Dec-82" "Jul-94" "May-00" "Jun-89" "Jun-00" "May-99" "Aug-99"
## [218] "Apr-97" "1998"   "1999"   "Aug-97" "May-97" "Nov-97" "Jul-99"
## [225] "Mar-97" "Mar-98" "Apr-00" "Jun-99" "Aug-98" "May-98" "Dec-97"
## [232] "Nov-98" "Jun-98" "Oct-99" "Dec-96" "1996"   "Jul-98" "Apr-98"
## [239] "Jan-98" "Oct-98" "Mar-99" "1997"   "Feb-98" "Sep-98" "Dec-98"
## [246] "Oct-97" "Jan-97" "Oct-96" "Jun-97" "Jun-01" "Nov-96" "Sep-96"
## [253] "Jun-96" "Feb-97" "Dec-95" "Nov-95" "May-96" "Oct-95" "Apr-96"
## [260] "Jan-96" "Sep-95" "Jun-95" "Feb-96" "Jul-95" "Apr-95"
yearstr <- gsub("\\D*", "", yearstr, perl = TRUE)
unique(yearstr)
##  [1] ""     "12"   "11"   "9"    "10"   "8"    "2012" "4"    "3"    "6"   
## [11] "5"    "2"    "7"    "2006" "00"   "1"    "2009" "2007" "2003" "2011"
## [21] "2002" "2008" "93"   "99"   "89"   "81"   "2004" "83"   "2005" "2010"
## [31] "85"   "87"   "97"   "79"   "1979" "80"   "82"   "90"   "88"   "96"  
## [41] "2001" "77"   "91"   "13"   "2013" "78"   "95"   "86"   "94"   "2000"
## [51] "1998" "1999" "98"   "1996" "1997" "01"
changeinds <- which(as.numeric(yearstr) < 14)
changevals <- as.numeric(yearstr[changeinds]) + 2000
yearstr[changeinds] <- as.character(changevals)
unique(yearstr)
##  [1] ""     "2012" "2011" "2009" "2010" "2008" "2004" "2003" "2006" "2005"
## [11] "2002" "2007" "2000" "2001" "93"   "99"   "89"   "81"   "83"   "85"  
## [21] "87"   "97"   "79"   "1979" "80"   "82"   "90"   "88"   "96"   "77"  
## [31] "91"   "2013" "78"   "95"   "86"   "94"   "1998" "1999" "98"   "1996"
## [41] "1997"
changeinds <- which(unlist(lapply(yearstr, nchar)) == 2)
changevals <- paste("19", yearstr[changeinds], sep = "")
yearstr[changeinds] <- changevals
df$PaperPubDateNumeric <- as.numeric(yearstr)

Now make a new variable for years since grant

df$YearsSinceGrant <- df$PaperPubDateNumeric - df$GrantStartYear

Remove bad rows

Rows with no papers, papers published prior to grant years (?!) and unrefereed papers.

df.filt <- subset(df, YearsSinceGrant >= 0)
df.filt <- subset(df.filt, Refereed == "refereed")

Basic Plots

ggplot(data = df.filt, aes(x = YearsSinceGrant)) + geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.

plot of chunk unnamed-chunk-25

ggplot maps various aesthetics to components of the data. It then draws things based on these aesthetics, and any transformations.

ggplot(data = df.filt, aes(x = YearsSinceGrant, y = CitationCount)) + geom_point()
## Warning: Removed 60 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-26

library(scales)
ggplot(data = df.filt, aes(x = YearsSinceGrant, y = CitationCount)) + geom_point() + 
    scale_y_log10()
## Warning: Removed 60 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-27

Final Plot

d <- ggplot(df.filt, aes(x = GrantStartYear, y = YearsSinceGrant)) + xlim(1995, 
    2011) + ylim(0, 16)
d <- d + geom_bin2d(binwidth = c(1, 1))
d

plot of chunk unnamed-chunk-28

Change the color

d <- d + scale_fill_gradient(low = "paleturquoise1", high = "midnightblue", 
    name = "# Papers")
d

plot of chunk unnamed-chunk-29

Change the theme

d <- d + theme_bw()
d <- d + theme(text = element_text(size = 24))
d <- d + theme(panel.grid = element_blank())
d

plot of chunk unnamed-chunk-30

Add a line

d <- d + geom_abline(intercept = 2012, slope = -1, size = 2, linetype = "dashed")
d

plot of chunk unnamed-chunk-31

Add a smoother

d <- d + geom_smooth(color = "red")
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-32

Change the legend location

library(grid)
d <- d + theme(legend.direction = "horizontal", legend.position = c(0.75, 0.8), 
    legend.key.width = unit(1, "cm"), legend.title.align = 1)
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-33

Change axis labels

d <- d + xlab("Grant Start Year") + ylab("Years Since Grant Start")
d <- d + ggtitle("References to NSF ASF Grants")
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-34

Label the future

d <- d + annotate("text", x = 2002, y = 11, label = "The Future", size = 7, 
    angle = -45, fontface = "italic") + coord_fixed()
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).

plot of chunk unnamed-chunk-35

Improvements

This figure is, I think, just OK. There are a lot of colors, and it can be hard to compare across colors sometimes. We also aren't comparing apples to apples when it comes to 1995's papers and 2010's papers, by virtue of there being fewer years from which to sample.

Probably the easiest way to compare is just to plot the paper production in the first year for all years.

Look just at the first year post-grant

my.subset <- subset(df.filt, YearsSinceGrant <= 1)
d <- ggplot(data = my.subset, aes(x = GrantStartYear)) + geom_histogram(binwidth = 1)

We can also use the stat_bin command with the geom option to plot this as a line instead of as bars.

d <- ggplot(data = my.subset, aes(x = GrantStartYear)) + stat_bin(binwidth = 1, 
    geom = "line")
d
## ymax not defined: adjusting position using y instead

plot of chunk unnamed-chunk-37

We can also use the stat_bin command with the geom option to plot this as a line instead of as bars.

Clearly, the number of papers in the first year has increased over time. You might ask, is this related to the number of grants? Is this related to the total amount of money? Let's compute and plot these.

unique.grants <- unique(my.subset[, c("GrantStartYear", "GrantNumber", "AwardedUSD")])
a.money <- aggregate(unique.grants$AwardedUSD, list(unique.grants$GrantStartYear), 
    sum)
a.grants <- aggregate(unique.grants$GrantStartYear, list(unique.grants$GrantStartYear), 
    length)
a.papers <- aggregate(my.subset$PaperTitle, list(my.subset$GrantStartYear), 
    length)
new.df <- data.frame(year = a.money$Group.1, nGrants = a.grants$x/max(a.grants$x), 
    nDollars = a.money$x/max(a.money$x), nPapers = a.papers$x/max(a.papers$x))

ggplot(data = new.df, aes(x = year)) + geom_line(aes(y = nGrants, color = "Grants")) + 
    geom_line(aes(y = nDollars, color = "Dollars")) + geom_line(aes(y = nPapers, 
    color = "Papers")) + scale_colour_manual("", breaks = c("Grants", "Dollars", 
    "Papers"), values = c("red", "black", "blue")) + ylab("Number (relative to maximum)") + 
    xlab("Year") + ggtitle("Grants, Funding and Publications over Time")

plot of chunk unnamed-chunk-38

Details on this and other styles are available at http://stackoverflow.com/questions/10349206/add-legend-to-ggplot2-line-plot

We can also display this information in a different way, using densities as opposed to a 2-d heat map.

ggplot(data = df.filt) + geom_density(aes(x = YearsSinceGrant, color = factor(GrantStartYear), 
    fill = factor(GrantStartYear)), adjust = 3, alpha = 0.2)

plot of chunk unnamed-chunk-39

This is still lacking. We'd like to split this up somehow to see what's going on more clearly!

ggplot(data = df.filt) + geom_density(aes(x = YearsSinceGrant, color = factor(GrantStartYear), 
    fill = factor(GrantStartYear)), adjust = 3, alpha = 0.2) + facet_grid(Administration ~ 
    .)

plot of chunk unnamed-chunk-40