Quick note: Three packages I would like you to download: ggplot2, knitr, reshape2

Why RMarkdown?

Creating, dynamic reports is important to ensure for transparency in the research community. Thankfully, we are able to integrate both computing and reporting with the package {knitr}. In this class, the reports are generated with Sweave (Ottar’s bread and butter) . Today, we will learn and utilize Rmarkdown. Rmarkdown is the R version of Markdown which allows for easy-to-read, easy-to-write plain text format and has been lauded by data analysts for its ease of use (simple formatting, intiuitiveness, etc.)

There are many pros of using Rmarkdown:

  1. HTML output means it is useful for blogs, websites, or anything internet related.
  2. Rmarkdown can also be converted into PDF files, Word Documents, and Powerpoint (Ioslides)
  3. Interactive documents: Your readers can interact with your graphs through Shiny.
  4. Plentiful documentation: tutorials, guides, and examples.
  5. Perfect for collaboration (version control), homework, reports
  6. Integrated with Rstudio
  7. Free web publishing through RPubs.

There are some cons of using Rmarkdown

  1. Restricted references compared to Sweave
  2. Unable to create complicated tables
  3. Less control than Sweave

The Basics of Rmarkdown

Code Chunks

print("Hello World!")
## [1] "Hello World!"

The reason why Rmarkdown is so simple is because of the code chunks above. Rmarkdown will evaluate anything that is in a chunk. The code chunks will be evaluated as if the code was entered into the console! This is intuitive in that you don’t need to do anything special to include outputs, graphs, etc.

You can easily assign variables.

x = 2
x
## [1] 2

Functions

We can even write functions in Rmarkdown. Using previous class notes, we can put the SIR function in its own code chunk

 sirmod=function(t, x, parms){
   S=x[1]
   I=x[2]
   R=x[3]

  beta=parms["beta"]
  mu=parms["mu"]
  gamma=parms["gamma"]
  N=parms["N"]

  dS = mu * (N - S) - beta * S * I / N
  dI = beta * S * I / N - (mu + gamma) * I
  dR = gamma * I - mu * R
  res=c(dS, dI, dR)
  list(res)
 }

These are the times, parameters, and the starting values.

times = seq(0, 26, by=1/10)
parms = c(mu = 0, N = 1, beta = 2, gamma = 1/2)
start = c(S=0.999, I=0.001, R = 0)

We are able to see the output

out=ode(y=start,
times=times,
func=sirmod,
parms=parms)
out=as.data.frame(out)
head(round(out, 3))
##   time     S     I R
## 1  0.0 0.999 0.001 0
## 2  0.1 0.999 0.001 0
## 3  0.2 0.999 0.001 0
## 4  0.3 0.998 0.002 0
## 5  0.4 0.998 0.002 0
## 6  0.5 0.998 0.002 0

Figures

Adding figures is quite easy in Rmarkdown. If your chunk has a code that generates a graph, Rmarkdown will insert the image! Note that not every plot needs it own code chunk. If several plot codes are in a single chunk, than Rmarkdown will simply insert the graph one after the other.

plot(x=out$time, y=out$S, ylab="fraction", xlab="time", type="l")
lines(x=out$time, y=out$I, col="red")
lines(x=out$time, y=out$R, col="green")

You also have the added flexibility of adjusting the height and width of the figure. If you include fig.width for your chunk option then you can change your figure’s width.

plot(x=out$time, y=out$S, ylab="fraction", xlab="time", type="l")
lines(x=out$time, y=out$I, col="red")
lines(x=out$time, y=out$R, col="green")

Data Analysis

Let’s create some ridiculously fake data to show how important Rmarkdown is for statistical analyses.

test1=data.frame(ind = seq(1,10), dep = seq(1,10))
plot(test1,type='o')

test1lm <- lm(dep~ind,data=test1)
summary(test1lm)
## Warning in summary.lm(test1lm): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = dep ~ ind, data = test1)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.661e-16 -1.157e-16  4.273e-17  2.153e-16  4.167e-16 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 1.123e-15  2.458e-16 4.571e+00  0.00182 ** 
## ind         1.000e+00  3.961e-17 2.525e+16  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.598e-16 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.374e+32 on 1 and 8 DF,  p-value: < 2.2e-16

Equations

If you learn a bit of LaTex (Learning how to write equations takes about <30 minutes), you can also write in mathematical equations directly unto the document.

Here is the SIR equations:

\[\frac{dS}{dt} = \mu(N-S) - \frac{\beta IS}{N}\] \[\frac{dI}{dt} = \frac{\beta IS}{N} - (\mu+\gamma)I\] \[\frac{dR}{dt} = \gamma I - \mu R\]


Miscellaneous

You can also add in pictures (From your computer or online) and gifs are possible as well. (Though note only in HTML)

In the ggplot2 section, I will show how to make tables both manually and automatically

Conclusion

By learning a bit of basic HTML, LaTex, and RMarkdown, you can truly make a dynamic report.


ggplot2

What is ggplot?

The aim of the grammar is to “bring together in a coherent way things that previously appeared unrelated and which also will provide a basis for dealing systematically with new situations” (Cox 1978). How well have we succeeded? (Wickham, 2012)

The emphasis in ggplot2 is reducing the amount of thinking time by making it easier to go from the plot in your brain to the plot on the page." (Wickham, 2012)

Base graphics are good for drawing pictures; ggplot2 graphics are good for understanding the data." (Wickham, 2012)

Both a package and a phenomenon, ggplot was created by Hadley Wickham (2005). This package allows for flexible data visualization by implementing the grammar of graphics (Leland Wilkonson) . While known for its difficulty curve in the beginning, once you know the basic grammar (and with the help of Google) you can truly make some beautiful graphs.

Why should you use ggplot?

  1. Flexible
  2. High level of control
  3. Beautiful graphs
  4. Most importantly, documentation (books, websites, infographics, etc.) (So much documentation that the addition of my introduction/guide is superfluous )

Grammar of graphic

Let’s use the data that is included in ggplot2 {diamonds}

Variable Description
Price Price in US dollars ($326-$18,823)
Carat Weight of the diamond (0.2-5.01)
Cut Quality of the cut (Fair, Good, Very Good, Premium, Ideal)
Colour Diamond colour, from J (worst) to D (best)
Clarity How clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
x Length in mm (0-10.74)
y Width in mm (0-58.9)
z Wepth in mm (0-31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table width of top of diamond relative to widest point (43-95)

Let’s look at the first few entries

kable(head(diamonds))
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

Let’s put our ggplot2 into action. Let’s see the relationship between the price and carat of the diamond. Here is the simple ggplot code:

ggplot(diamonds, aes(x=carat,y=price))+geom_point()

Borrowing from another source ( Thomas Hopper), a ggplot2 graph is built from a few basic elements

  1. Data, which is the raw data that you want to plot
  2. Geometrics (geom_), which is the geometric shapes that will represent the data
  3. Aethetics (aes()), aesthetics of the geometric and statistical objects, such as color,size,s hape and position
  4. Scales (scale_), maps between the data and the aesthetic dimensions, such as data range to plot width or factor values to colors

The first part of the ggplot code is the data. This will be the foundation for the rest of the code. We are saying that we will be using the dataset {diamonds}, and aes (stands for aesthetics) means that we are putting our x (carat) in the x-axis and y (price) into the y-axis. Now, we add the geometric layer which will be the points (geom_point).

The best way to think about ggplot2 is the act of adding layers upon layers.

Now, let’s make it more complicated.

I’m curious if there is a pattern with cut as well. Now I’m going to say to ggplot: please assign different colors to different types of cuts, colors (of diamond), and clarity.

ggplot(diamonds,aes(x=carat,y=price,color=cut))+geom_point()

ggplot(diamonds,aes(x=carat,y=price,color=color))+geom_point()

ggplot(diamonds,aes(x= carat,y=price,color = clarity))+geom_point()

Interestingly enough, when you make these plots, you can store the plot objects into variables. Thus I am able to add another layer to the variable, I want to add a theme layer that changes the background color and add some lines.

cut.caratprice <- ggplot(diamonds,aes(x=carat,y=price,color=cut))+geom_point()
color.caratprice<-ggplot(diamonds,aes(x=carat,y=price,color=color))+geom_point()
clarity.caratprice<-ggplot(diamonds,aes(x= carat,y=price,color = clarity))+geom_point()
cut.caratprice+theme_bw()

color.caratprice + geom_line()+theme_bw()

Now I’m interested in seeing if there is a relationship between carat and depth, I want the cut of the diamond to be color-coded and the color of the diamond to be related to the size of the points. This shows how versatile ggplot2 can be.

ggplot(diamonds,aes(x=carat,y=depth,color=cut,size=color))+geom_point()+theme_bw()
## Warning: Using size for a discrete variable is not advised.


Theme

One highlight of ggplot is that you can skilfully change the details of your graph. To showcase this, let’s look at the United States economic time series.

data(economics)
kable(head(economics))
date pce pop psavert uempmed unemploy
1967-07-01 507.4 198712 12.5 4.5 2944
1967-08-01 510.5 198911 12.5 4.7 2945
1967-09-01 516.3 199113 11.7 4.6 2958
1967-10-01 512.9 199311 12.5 4.9 3143
1967-11-01 518.1 199498 12.5 4.7 3066
1967-12-01 525.8 199657 12.1 4.8 3018

Variables

  1. date: month of data collection
  2. psavert: Personal saving rate
  3. personal: comsumption expenditures, in billions of dollars
  4. unemploy: number of emplyoed in thousands
  5. unempmed: median duration of unemployment, in week
  6. pop: total population, in thousands

Creating a time series to show how the number of unemployment changes over the years.

ggplot(economics, aes(x= date, y= unemploy))+geom_point()+geom_line()

Hm… I’m not crazy about the grey background, I want to change the color and size of the points, and the linetype. So I am able to change aspects of the graph by changing the layers associated with it. The graph is not aesthetically pleasing (Most journals will probably reject it) but I think by making this unattractive graph I am able to show the control we can have!

ggplot(economics,aes(x=date,y=unemploy))+ geom_point(size = 1.5,colour='orange')+geom_line(size=1,alpha=0.7,linetype=2)+theme_bw()+labs(x="Date",y="Unemployment (In thousands")+ggtitle("Unemployment in the USA")+
  geom_vline(xintercept= as.numeric(as.Date("2007-01-01")),color='red')

We can also do smoothing with different methods (loess is the default with n < 1000)

ggplot(economics, aes(x=date,y=unemploy))+geom_point()+
  geom_line()+stat_smooth(fill="yellow")+theme_bw()

Here is a great infographic on how to change the details of the graph: Cheatsheet

Cookbook


Other graphs

Let’s look at this car dataset {mpg} that is included in ggplot2 as well.

kable(head(mpg))
manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact

I’m interested in seeing how many counts of different car classes there are.

The stat is another part of the grammar and are related to summaries (quantile, sums, so on). In this code, we make the height of the bar equal to the number of cases in each group.

ggplot(mpg, aes(x= factor(class)))+geom_bar(stat='count',fill='navy')+theme_bw()

ggplot(mpg, aes(x= factor(manufacturer)))+geom_bar(stat='count',fill='gold')+theme_bw()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))

From our two graphs we can see that in the data, SUVs make up the bulk of the data and Dodge is the most represented.


Boxplots are also fairly simple to do in ggplot.

ggplot(mpg, aes(x=factor(class),y= hwy))+geom_boxplot(aes(fill= class)) +theme_bw()

ggplot(mpg, aes(x= factor(manufacturer),y=hwy))+geom_boxplot(aes(fill=manufacturer))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))


Faceting is also a useful part of ggplot2. Faceting is where you subset up your data by one or more variables than plot it together

ggplot(mtcars, aes(x=mpg,y=wt ))+geom_point(size=3)+geom_line()+theme_bw()+facet_grid(.~am)

You can facet with more than one variable as well

ggplot(mtcars,aes(x= mpg, y= wt))+geom_point(size=3)+geom_line()+theme_bw()+facet_grid(am~cyl)


qplot and conclusion

The best way to pra and try to recreate them with ggplot. For simple graphs, you may be thinking that using the basic plot function is easier but there is ggplot’s qplot (quick plot) that has the same aesthetics as the full ggplot.

Here is the scatter plot:

qplot(mpg,wt,data=mtcars)

Here is the line plot:

qplot (mpg,wt,data=mtcars,geom='line')

But let’s try converting the parasitoid/host plot into a ggplot. One caveat about ggplot2 is that data needs to be in long format. We can fix this problem by using reshape2 and melting our data (Another package by Wilkham)

burnett = read.csv("burnettparasitoid.csv")
burn=melt(burnett,id.vars= "Generation")

We are only looking at parasitoids and hosts so let’s subset.

burn2 = subset(burn,variable == c("NumberofHostsParasitized",                           "NumberofHostsUnparasitized"))
plot(burnett$Generation, burnett$NumberofHostsParasitized, type="b", ylab="numbers", xlab="generation")
 lines(burnett$Generation, burnett$NumberofHostsUnparasitized, type="b", col=2, pch=2)
legend("topleft",
legend=c("Parasitoid", "Host"), lty=c(1,1), pch=c(1,2), col=c(1,2))

ggplot(burn2,aes(x=Generation,y =value, color=variable)) + geom_point(aes(shape=factor(variable)),size =3)+scale_shape(solid=FALSE) +geom_line()+theme_bw()+labs(y='Numbers')+
scale_color_manual(values=c("red","black")) 

Not an exact copy, but this shows the difference between basic plot and ggplot2!