This file is online!
http://rpubs.com/alexplanation/dst4l
https://gist.github.com/alexstorer/5460479
"hello world"
## [1] "hello world"
If you use NumPy/SciPy, here is a conversion to R: http://mathesaurus.sourceforge.net/matlab-python-xref.pdf
mtcars["Honda Civic", "mpg"]
## [1] 30.4
lm(mpg ~ cyl + gear + wt, data = mtcars)
##
## Call:
## lm(formula = mpg ~ cyl + gear + wt, data = mtcars)
##
## Coefficients:
## (Intercept) cyl gear wt
## 42.386 -1.528 -0.523 -3.392
2 + 2
## [1] 4
log(2^(0.25))
## [1] 0.1733
# install.packages('ggplot2')
library(ggplot2)
ggplot(data = mtcars, aes(x = mpg, y = wt, color = cyl)) + geom_point()
![]()
You can even run RStudio on the internet, and set up your own FREE Amazon instance to run it:
http://www.louisaslett.com/RStudio_AMI/
Jump in with both feet, worry about the details later! More time spent doing.
df <- read.csv("http://files.figshare.com/983501/nsfastgrantbib20121130.csv")
<- symbol does assignment$[])NAnames(df)
## [1] "GrantPaperID" "NSFOrg"
## [3] "GrantNumber" "GrantTitle"
## [5] "AwardedUSD" "ARRAAmount"
## [7] "GrantLink" "PaperTitle"
## [9] "PaperJournal" "PaperBibcode"
## [11] "PaperLink" "Refereed"
## [13] "CitationCount" "NSFPrograms"
## [15] "GrantStartMonth" "GrantStartYear"
## [17] "PaperPubDate" "PIFirstName"
## [19] "PILastName" "PaperAuthor"
## [21] "PIMatch" "GranteeOrganization"
## [23] "GranteeOrganizationState" "GranteeOrganizationZip"
## [25] "NSFDirectorate" "PaperAuthorAffiliation"
## [27] "PaperAuthorCountry" "PaperKeywords"
df$PaperTitle[100]
## [1] Dust, Atomic, and Molecular Gas in the Nearest Primitive Environment
## 20200 Levels: ...
df[100, "PaperTitle"]
## [1] Dust, Atomic, and Molecular Gas in the Nearest Primitive Environment
## 20200 Levels: ...
head(df$PaperTitle)
## [1] 0 papers
## [2] 0 papers
## [3] Spitzer Evidence for a Late-heavy Bombardment and the Formation of Ureilites in η Corvi at ~1 Gyr
## [4] Spitzer Evidence for a Late Heavy Bombardment and the Formation of Urelites in η Corvi at ~1 Gyr
## [5] Mid-infrared spectra of differentiated meteorites (achondrites): Comparison with astronomical observations of dust in protoplanetary and debris disks
## [6] Spitzer Observations of η Corvi: Evidence at ~1 Gyr for an LHB-Like Delivery of Organics and Water-Rich Material to the THZ of a Sun-Like Star
## 20200 Levels: ...
df[5:10, c("PaperTitle", "GrantTitle")]
## PaperTitle
## 5 Mid-infrared spectra of differentiated meteorites (achondrites): Comparison with astronomical observations of dust in protoplanetary and debris disks
## 6 Spitzer Observations of η Corvi: Evidence at ~1 Gyr for an LHB-Like Delivery of Organics and Water-Rich Material to the THZ of a Sun-Like Star
## 7 Ionization of Infalling Gas
## 8 Sodium atoms in the lunar exotail: Observed velocity and spatial distributions
## 9 Measurement of a Magnetic Field in a Leading Arm High-velocity Cloud
## 10 Vertical Structure of a Supernova-driven Turbulent, Magnetized Interstellar Medium
## GrantTitle
## 5 Infrared Studies of Circumstellar Dust in Planet-Forming Exo-Systems
## 6 Infrared Studies of Circumstellar Dust in Planet-Forming Exo-Systems
## 7 Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 8 Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 9 Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
## 10 Ionized Gas in the Galaxy: A Southern Hemisphere Perspective with the Wisconsin H-Alpha Mapper
The c command, for combine, combines variables into arrays.
The : operator returns a vector of values
df[seq(5, 10, 2), c("GrantStartYear", "GrantStartMonth")]
## GrantStartYear GrantStartMonth
## 5 2009 9
## 7 2006 9
## 9 2006 9
The seq command is a more flexible version of :
head(df$CitationCount > 50)
## [1] NA NA FALSE FALSE FALSE FALSE
table(df$CitationCount > 50)
##
## FALSE TRUE
## 20884 4827
head(which(df$CitationCount > 50))
## [1] 89 90 98 135 136 140
paste("There are", length(which(df$CitationCount > 50)), "papers")
## [1] "There are 4827 papers"
head returns the first few elements of what you give ittable summarizes a list by its unique valueswhich gives the TRUE indices in a logical arraylength gives the length of any arraypaste concatenates stringsnumPapers <- length(unique(df[df$CitationCount > 50, "PaperTitle"]))
[df$CitationCount>50,"PaperTitle"] - Just the titlesunique removes any repeatslength tells us how many there are# Try this one yourself!
Like unique but gives us the indices (logical or otherwise).
table(duplicated(df$GrantNumber))
##
## FALSE TRUE
## 3432 26075
popular.papers <- subset(df, CitationCount > 50)
less.popular.papers <- subset(df, CitationCount < 10)
mpp <- mean(popular.papers$AwardedUSD)
lpp <- mean(less.popular.papers$AwardedUSD)
paste(">50:", mpp, "<10", lpp)
## [1] ">50: 1184260.64926455 <10 1482004.83885319"
subset returns a data frame with only the matching elementsmean takes the mean of your datadf$AwardAmountGroups <- cut(df$AwardedUSD, 3, labels = c("low", "med", "hi"))
df$Administration <- cut(df$GrantStartYear, breaks = c(1992, 2000, 2008, 2013),
labels = c("Clinton", "Bush", "Obama"))
with(df, table(Administration, AwardAmountGroups))
## AwardAmountGroups
## Administration low med hi
## Clinton 9318 52 0
## Bush 16348 3 2
## Obama 3784 0 0
$ and <- can be used to assign a new column to a data framecut splits your data into groupings by value, either into N groups, or as directedDo grants start more frequently at a given time of year?
What proportion of papers are referreed?
What state received the most grants?
Find the most cited paper. How many other papers were published on that grant?
How many papers refer to at least three grants?
df.grants <- df[!duplicated(df$GrantNumber), ]
table(df.grants$GrantStartMonth)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 76 53 110 115 187 333 723 707 870 193 33 32
plot(table(df.grants$GrantStartMonth))
df.papers <- df[!duplicated(df$PaperTitle), ]
table(df.papers$Refereed)
##
## 0 papers non-refereed refereed
## 1 5441 14758
df.grants <- df[!duplicated(df$GrantNumber), ]
sort(table(df.grants$GranteeOrganizationState))
##
## AK ND ME NE PR MT RI AR OR WV DE NV VT UT KS AL IA
## 1 1 2 3 3 4 4 4 5 5 5 6 6 6 9 10 15 15
## LA OK WY NH KY SC MO TN GA NC IN MN NM WA CT CO WI FL
## 16 16 17 21 22 25 26 34 35 42 43 43 58 58 63 86 94 95
## MD MI NJ OH HI VA IL DC PA TX AZ MA NY CA
## 99 102 102 106 113 115 120 144 151 156 203 262 264 597
max.citations <- max(df$CitationCount)
max.citations <- max(df$CitationCount, na.rm = T)
max.index <- which(df$CitationCount == max.citations)
max.grants <- df[max.index, "GrantNumber"]
# Note: this paper was funded by two grants!
max.subset <- subset(df, GrantNumber == max.grants[1] | GrantNumber == max.grants[2])
length(unique(max.subset$PaperTitle))
## [1] 11
# Here is a more elegant way:
max.grants <- df[which.max(df$CitationCount), "GrantNumber"]
max.subset <- subset(df, GrantNumber %in% max.grants)
length(unique(max.subset$PaperTitle))
## [1] 5
papers <- aggregate(df$GrantNumber, list(df$PaperTitle), length)
length(which(papers$x > 3))
## [1] 589

x-axis - Year of Paper y-axis - Years Since Grant color - Number of Papers
levels(df$PaperPubDate)
## [1] "" "00/1979" "00/1996" "00/1997" "00/1998" "00/1999" "00/2000"
## [8] "00/2001" "00/2002" "00/2003" "00/2004" "00/2005" "00/2006" "00/2007"
## [15] "00/2008" "00/2009" "00/2010" "00/2011" "00/2012" "00/2013" "1-Apr"
## [22] "1-Aug" "1-Dec" "1-Feb" "1-Jan" "1-Jul" "1-Jun" "1-Mar"
## [29] "1-May" "1-Nov" "1-Oct" "1-Sep" "10-Apr" "10-Aug" "10-Dec"
## [36] "10-Feb" "10-Jan" "10-Jul" "10-Jun" "10-Mar" "10-May" "10-Nov"
## [43] "10-Oct" "10-Sep" "11-Apr" "11-Aug" "11-Dec" "11-Feb" "11-Jan"
## [50] "11-Jul" "11-Jun" "11-Mar" "11-May" "11-Nov" "11-Oct" "11-Sep"
## [57] "12-Apr" "12-Aug" "12-Dec" "12-Feb" "12-Jan" "12-Jul" "12-Jun"
## [64] "12-Mar" "12-May" "12-Nov" "12-Oct" "12-Sep" "13-Jan" "2-Apr"
## [71] "2-Aug" "2-Dec" "2-Feb" "2-Jan" "2-Jul" "2-Jun" "2-Mar"
## [78] "2-May" "2-Nov" "2-Oct" "2-Sep" "2012" "3-Apr" "3-Aug"
## [85] "3-Dec" "3-Feb" "3-Jan" "3-Jul" "3-Jun" "3-Mar" "3-May"
## [92] "3-Nov" "3-Oct" "3-Sep" "4-Apr" "4-Aug" "4-Dec" "4-Feb"
## [99] "4-Jan" "4-Jul" "4-Jun" "4-Mar" "4-May" "4-Nov" "4-Oct"
## [106] "4-Sep" "5-Apr" "5-Aug" "5-Dec" "5-Feb" "5-Jan" "5-Jul"
## [113] "5-Jun" "5-Mar" "5-May" "5-Nov" "5-Oct" "5-Sep" "6-Apr"
## [120] "6-Aug" "6-Dec" "6-Feb" "6-Jan" "6-Jul" "6-Jun" "6-Mar"
## [127] "6-May" "6-Nov" "6-Oct" "6-Sep" "7-Apr" "7-Aug" "7-Dec"
## [134] "7-Feb" "7-Jan" "7-Jul" "7-Jun" "7-Mar" "7-May" "7-Nov"
## [141] "7-Oct" "7-Sep" "8-Apr" "8-Aug" "8-Dec" "8-Feb" "8-Jan"
## [148] "8-Jul" "8-Jun" "8-Mar" "8-May" "8-Nov" "8-Oct" "8-Sep"
## [155] "9-Apr" "9-Aug" "9-Dec" "9-Feb" "9-Jan" "9-Jul" "9-Jun"
## [162] "9-Mar" "9-May" "9-Nov" "9-Oct" "9-Sep" "Apr-00" "Apr-79"
## [169] "Apr-91" "Apr-94" "Apr-95" "Apr-96" "Apr-97" "Apr-98" "Apr-99"
## [176] "Aug-00" "Aug-79" "Aug-81" "Aug-86" "Aug-88" "Aug-91" "Aug-95"
## [183] "Aug-96" "Aug-97" "Aug-98" "Aug-99" "Dec-00" "Dec-82" "Dec-87"
## [190] "Dec-90" "Dec-95" "Dec-96" "Dec-97" "Dec-98" "Dec-99" "Feb-00"
## [197] "Feb-79" "Feb-82" "Feb-83" "Feb-89" "Feb-96" "Feb-97" "Feb-98"
## [204] "Feb-99" "Jan-00" "Jan-78" "Jan-81" "Jan-83" "Jan-93" "Jan-96"
## [211] "Jan-97" "Jan-98" "Jan-99" "Jul-00" "Jul-80" "Jul-83" "Jul-87"
## [218] "Jul-88" "Jul-94" "Jul-95" "Jul-96" "Jul-97" "Jul-98" "Jul-99"
## [225] "Jun-00" "Jun-01" "Jun-89" "Jun-95" "Jun-96" "Jun-97" "Jun-98"
## [232] "Jun-99" "Mar-00" "Mar-77" "Mar-89" "Mar-96" "Mar-97" "Mar-98"
## [239] "Mar-99" "May-00" "May-81" "May-85" "May-93" "May-96" "May-97"
## [246] "May-98" "May-99" "Nov-00" "Nov-95" "Nov-96" "Nov-97" "Nov-98"
## [253] "Nov-99" "Oct-00" "Oct-95" "Oct-96" "Oct-97" "Oct-98" "Oct-99"
## [260] "Sep-00" "Sep-86" "Sep-95" "Sep-96" "Sep-97" "Sep-98" "Sep-99"
yearstr <- paste(df$PaperPubDate)
unique(yearstr)
## [1] "" "12-Mar" "12-Feb" "12-May" "11-Mar" "9-Jan" "12-Jun"
## [8] "10-Dec" "12-Jul" "8-Oct" "8-Nov" "9-Jul" "9-Oct" "12-Nov"
## [15] "10-Oct" "12-Sep" "11-Jan" "10-Jan" "12-Jan" "11-Nov" "10-Aug"
## [22] "10-Feb" "12-Apr" "10-Jun" "11-Oct" "11-Jul" "10-Sep" "11-Dec"
## [29] "11-Feb" "11-Sep" "12-Oct" "00/2012" "11-Jun" "11-May" "9-Mar"
## [36] "4-Apr" "8-Jun" "3-Nov" "4-Aug" "6-Feb" "8-Mar" "5-Jul"
## [43] "4-Dec" "2-Dec" "5-May" "3-Apr" "5-Sep" "6-Dec" "7-Jul"
## [50] "5-Jun" "5-Oct" "4-Mar" "10-May" "9-Nov" "9-Dec" "9-Feb"
## [57] "7-Jun" "9-Jun" "9-Apr" "7-Mar" "7-Oct" "7-Feb" "6-Jun"
## [64] "5-Mar" "7-Jan" "6-Oct" "8-Sep" "8-Feb" "7-Dec" "5-Dec"
## [71] "5-Aug" "9-Sep" "3-Mar" "7-Apr" "4-Nov" "3-May" "2-Sep"
## [78] "11-Aug" "12-Aug" "10-Nov" "11-Apr" "10-Jul" "8-Jan" "10-Apr"
## [85] "7-Sep" "5-Jan" "7-Nov" "6-Aug" "9-May" "4-Jun" "00/2006"
## [92] "5-Nov" "9-Aug" "8-Apr" "7-Aug" "8-Jul" "10-Mar" "7-May"
## [99] "4-May" "6-Nov" "6-Jul" "3-Dec" "4-Jul" "6-May" "5-Feb"
## [106] "4-Jan" "12-Dec" "8-May" "6-Sep" "6-Mar" "6-Apr" "Jul-00"
## [113] "1-Jun" "1-Oct" "8-Dec" "00/2009" "6-Jan" "00/2007" "4-Oct"
## [120] "3-Oct" "5-Apr" "3-Aug" "3-Jun" "4-Sep" "00/2003" "00/2011"
## [127] "1-Jul" "00/2002" "2-Mar" "3-Feb" "2-May" "2-Apr" "2-Jun"
## [134] "1-Nov" "2-Oct" "1-Dec" "8-Aug" "00/2008" "Jan-93" "3-Sep"
## [141] "2012" "2-Nov" "4-Feb" "Dec-99" "Feb-89" "Aug-81" "00/2004"
## [148] "Jul-83" "00/2005" "3-Jan" "3-Jul" "00/2010" "2-Feb" "2-Jul"
## [155] "2-Aug" "May-85" "Jul-87" "2-Jan" "Jul-97" "Aug-79" "00/1979"
## [162] "Apr-79" "Jul-80" "May-81" "Mar-89" "Feb-82" "Dec-87" "Dec-90"
## [169] "Aug-88" "Oct-00" "Sep-97" "Aug-96" "1-Apr" "00/2001" "1-Sep"
## [176] "Dec-00" "1-May" "Sep-00" "Jan-00" "Feb-00" "Sep-99" "Nov-99"
## [183] "Mar-77" "Aug-91" "13-Jan" "Nov-00" "00/2013" "1-Mar" "1-Jan"
## [190] "1-Feb" "Jan-99" "Jan-78" "Feb-83" "Jan-81" "Jan-83" "1-Aug"
## [197] "Apr-91" "Mar-96" "Aug-95" "Feb-79" "Mar-00" "Sep-86" "Jul-96"
## [204] "Jul-88" "Feb-99" "May-93" "Apr-94" "Aug-86" "00/2000" "Apr-99"
## [211] "Aug-00" "Dec-82" "Jul-94" "May-00" "Jun-89" "Jun-00" "May-99"
## [218] "Aug-99" "Apr-97" "00/1998" "00/1999" "Aug-97" "May-97" "Nov-97"
## [225] "Jul-99" "Mar-97" "Mar-98" "Apr-00" "Jun-99" "Aug-98" "May-98"
## [232] "Dec-97" "Nov-98" "Jun-98" "Oct-99" "Dec-96" "00/1996" "Jul-98"
## [239] "Apr-98" "Jan-98" "Oct-98" "Mar-99" "00/1997" "Feb-98" "Sep-98"
## [246] "Dec-98" "Oct-97" "Jan-97" "Oct-96" "Jun-97" "Jun-01" "Nov-96"
## [253] "Sep-96" "Jun-96" "Feb-97" "Dec-95" "Nov-95" "May-96" "Oct-95"
## [260] "Apr-96" "Jan-96" "Sep-95" "Jun-95" "Feb-96" "Jul-95" "Apr-95"
yearstr <- gsub(".*/", "", yearstr, perl = TRUE)
unique(yearstr)
## [1] "" "12-Mar" "12-Feb" "12-May" "11-Mar" "9-Jan" "12-Jun"
## [8] "10-Dec" "12-Jul" "8-Oct" "8-Nov" "9-Jul" "9-Oct" "12-Nov"
## [15] "10-Oct" "12-Sep" "11-Jan" "10-Jan" "12-Jan" "11-Nov" "10-Aug"
## [22] "10-Feb" "12-Apr" "10-Jun" "11-Oct" "11-Jul" "10-Sep" "11-Dec"
## [29] "11-Feb" "11-Sep" "12-Oct" "2012" "11-Jun" "11-May" "9-Mar"
## [36] "4-Apr" "8-Jun" "3-Nov" "4-Aug" "6-Feb" "8-Mar" "5-Jul"
## [43] "4-Dec" "2-Dec" "5-May" "3-Apr" "5-Sep" "6-Dec" "7-Jul"
## [50] "5-Jun" "5-Oct" "4-Mar" "10-May" "9-Nov" "9-Dec" "9-Feb"
## [57] "7-Jun" "9-Jun" "9-Apr" "7-Mar" "7-Oct" "7-Feb" "6-Jun"
## [64] "5-Mar" "7-Jan" "6-Oct" "8-Sep" "8-Feb" "7-Dec" "5-Dec"
## [71] "5-Aug" "9-Sep" "3-Mar" "7-Apr" "4-Nov" "3-May" "2-Sep"
## [78] "11-Aug" "12-Aug" "10-Nov" "11-Apr" "10-Jul" "8-Jan" "10-Apr"
## [85] "7-Sep" "5-Jan" "7-Nov" "6-Aug" "9-May" "4-Jun" "2006"
## [92] "5-Nov" "9-Aug" "8-Apr" "7-Aug" "8-Jul" "10-Mar" "7-May"
## [99] "4-May" "6-Nov" "6-Jul" "3-Dec" "4-Jul" "6-May" "5-Feb"
## [106] "4-Jan" "12-Dec" "8-May" "6-Sep" "6-Mar" "6-Apr" "Jul-00"
## [113] "1-Jun" "1-Oct" "8-Dec" "2009" "6-Jan" "2007" "4-Oct"
## [120] "3-Oct" "5-Apr" "3-Aug" "3-Jun" "4-Sep" "2003" "2011"
## [127] "1-Jul" "2002" "2-Mar" "3-Feb" "2-May" "2-Apr" "2-Jun"
## [134] "1-Nov" "2-Oct" "1-Dec" "8-Aug" "2008" "Jan-93" "3-Sep"
## [141] "2-Nov" "4-Feb" "Dec-99" "Feb-89" "Aug-81" "2004" "Jul-83"
## [148] "2005" "3-Jan" "3-Jul" "2010" "2-Feb" "2-Jul" "2-Aug"
## [155] "May-85" "Jul-87" "2-Jan" "Jul-97" "Aug-79" "1979" "Apr-79"
## [162] "Jul-80" "May-81" "Mar-89" "Feb-82" "Dec-87" "Dec-90" "Aug-88"
## [169] "Oct-00" "Sep-97" "Aug-96" "1-Apr" "2001" "1-Sep" "Dec-00"
## [176] "1-May" "Sep-00" "Jan-00" "Feb-00" "Sep-99" "Nov-99" "Mar-77"
## [183] "Aug-91" "13-Jan" "Nov-00" "2013" "1-Mar" "1-Jan" "1-Feb"
## [190] "Jan-99" "Jan-78" "Feb-83" "Jan-81" "Jan-83" "1-Aug" "Apr-91"
## [197] "Mar-96" "Aug-95" "Feb-79" "Mar-00" "Sep-86" "Jul-96" "Jul-88"
## [204] "Feb-99" "May-93" "Apr-94" "Aug-86" "2000" "Apr-99" "Aug-00"
## [211] "Dec-82" "Jul-94" "May-00" "Jun-89" "Jun-00" "May-99" "Aug-99"
## [218] "Apr-97" "1998" "1999" "Aug-97" "May-97" "Nov-97" "Jul-99"
## [225] "Mar-97" "Mar-98" "Apr-00" "Jun-99" "Aug-98" "May-98" "Dec-97"
## [232] "Nov-98" "Jun-98" "Oct-99" "Dec-96" "1996" "Jul-98" "Apr-98"
## [239] "Jan-98" "Oct-98" "Mar-99" "1997" "Feb-98" "Sep-98" "Dec-98"
## [246] "Oct-97" "Jan-97" "Oct-96" "Jun-97" "Jun-01" "Nov-96" "Sep-96"
## [253] "Jun-96" "Feb-97" "Dec-95" "Nov-95" "May-96" "Oct-95" "Apr-96"
## [260] "Jan-96" "Sep-95" "Jun-95" "Feb-96" "Jul-95" "Apr-95"
yearstr <- gsub("\\D*", "", yearstr, perl = TRUE)
unique(yearstr)
## [1] "" "12" "11" "9" "10" "8" "2012" "4" "3" "6"
## [11] "5" "2" "7" "2006" "00" "1" "2009" "2007" "2003" "2011"
## [21] "2002" "2008" "93" "99" "89" "81" "2004" "83" "2005" "2010"
## [31] "85" "87" "97" "79" "1979" "80" "82" "90" "88" "96"
## [41] "2001" "77" "91" "13" "2013" "78" "95" "86" "94" "2000"
## [51] "1998" "1999" "98" "1996" "1997" "01"
changeinds <- which(as.numeric(yearstr) < 14)
changevals <- as.numeric(yearstr[changeinds]) + 2000
yearstr[changeinds] <- as.character(changevals)
unique(yearstr)
## [1] "" "2012" "2011" "2009" "2010" "2008" "2004" "2003" "2006" "2005"
## [11] "2002" "2007" "2000" "2001" "93" "99" "89" "81" "83" "85"
## [21] "87" "97" "79" "1979" "80" "82" "90" "88" "96" "77"
## [31] "91" "2013" "78" "95" "86" "94" "1998" "1999" "98" "1996"
## [41] "1997"
changeinds <- which(unlist(lapply(yearstr, nchar)) == 2)
changevals <- paste("19", yearstr[changeinds], sep = "")
yearstr[changeinds] <- changevals
df$PaperPubDateNumeric <- as.numeric(yearstr)
df$YearsSinceGrant <- df$PaperPubDateNumeric - df$GrantStartYear
Rows with no papers, papers published prior to grant years (?!) and unrefereed papers.
df.filt <- subset(df, YearsSinceGrant >= 0)
df.filt <- subset(df.filt, Refereed == "refereed")
ggplot(data = df.filt, aes(x = YearsSinceGrant)) + geom_bar()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this.
ggplot maps various aesthetics to components of the data. It then draws things based on these aesthetics, and any transformations.
ggplot(data = df.filt, aes(x = YearsSinceGrant, y = CitationCount)) + geom_point()
## Warning: Removed 60 rows containing missing values (geom_point).
library(scales)
ggplot(data = df.filt, aes(x = YearsSinceGrant, y = CitationCount)) + geom_point() +
scale_y_log10()
## Warning: Removed 60 rows containing missing values (geom_point).
d <- ggplot(df.filt, aes(x = GrantStartYear, y = YearsSinceGrant)) + xlim(1995,
2011) + ylim(0, 16)
d <- d + geom_bin2d(binwidth = c(1, 1))
d
d <- d + scale_fill_gradient(low = "paleturquoise1", high = "midnightblue",
name = "# Papers")
d
d <- d + theme_bw()
d <- d + theme(text = element_text(size = 24))
d <- d + theme(panel.grid = element_blank())
d
d <- d + geom_abline(intercept = 2012, slope = -1, size = 2, linetype = "dashed")
d
d <- d + geom_smooth(color = "red")
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).
library(grid)
d <- d + theme(legend.direction = "horizontal", legend.position = c(0.75, 0.8),
legend.key.width = unit(1, "cm"), legend.title.align = 1)
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).
d <- d + xlab("Grant Start Year") + ylab("Years Since Grant Start")
d <- d + ggtitle("References to NSF ASF Grants")
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).
d <- d + annotate("text", x = 2002, y = 11, label = "The Future", size = 7,
angle = -45, fontface = "italic") + coord_fixed()
d
## geom_smooth: method="auto" and size of largest group is >=1000, so using
## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the
## smoothing method.
## Warning: Removed 18 rows containing missing values (stat_smooth).
This figure is, I think, just OK. There are a lot of colors, and it can be hard to compare across colors sometimes. We also aren't comparing apples to apples when it comes to 1995's papers and 2010's papers, by virtue of there being fewer years from which to sample.
Probably the easiest way to compare is just to plot the paper production in the first year for all years.
my.subset <- subset(df.filt, YearsSinceGrant <= 1)
d <- ggplot(data = my.subset, aes(x = GrantStartYear)) + geom_histogram(binwidth = 1)
We can also use the stat_bin command with the geom option to plot this as a line instead of as bars.
d <- ggplot(data = my.subset, aes(x = GrantStartYear)) + stat_bin(binwidth = 1,
geom = "line")
d
## ymax not defined: adjusting position using y instead
We can also use the stat_bin command with the geom option to plot this as a line instead of as bars.
Clearly, the number of papers in the first year has increased over time. You might ask, is this related to the number of grants? Is this related to the total amount of money? Let's compute and plot these.
unique.grants <- unique(my.subset[, c("GrantStartYear", "GrantNumber", "AwardedUSD")])
a.money <- aggregate(unique.grants$AwardedUSD, list(unique.grants$GrantStartYear),
sum)
a.grants <- aggregate(unique.grants$GrantStartYear, list(unique.grants$GrantStartYear),
length)
a.papers <- aggregate(my.subset$PaperTitle, list(my.subset$GrantStartYear),
length)
new.df <- data.frame(year = a.money$Group.1, nGrants = a.grants$x/max(a.grants$x),
nDollars = a.money$x/max(a.money$x), nPapers = a.papers$x/max(a.papers$x))
ggplot(data = new.df, aes(x = year)) + geom_line(aes(y = nGrants, color = "Grants")) +
geom_line(aes(y = nDollars, color = "Dollars")) + geom_line(aes(y = nPapers,
color = "Papers")) + scale_colour_manual("", breaks = c("Grants", "Dollars",
"Papers"), values = c("red", "black", "blue")) + ylab("Number (relative to maximum)") +
xlab("Year") + ggtitle("Grants, Funding and Publications over Time")
Details on this and other styles are available at http://stackoverflow.com/questions/10349206/add-legend-to-ggplot2-line-plot
We can also display this information in a different way, using densities as opposed to a 2-d heat map.
ggplot(data = df.filt) + geom_density(aes(x = YearsSinceGrant, color = factor(GrantStartYear),
fill = factor(GrantStartYear)), adjust = 3, alpha = 0.2)
This is still lacking. We'd like to split this up somehow to see what's going on more clearly!
ggplot(data = df.filt) + geom_density(aes(x = YearsSinceGrant, color = factor(GrantStartYear),
fill = factor(GrantStartYear)), adjust = 3, alpha = 0.2) + facet_grid(Administration ~
.)