Quick note: Three packages I would like you to download: ggplot2, knitr, reshape2
Creating, dynamic reports is important to ensure for transparency in the research community. Thankfully, we are able to integrate both computing and reporting with the package {knitr}. In this class, the reports are generated with Sweave (Ottar’s bread and butter) . Today, we will learn and utilize Rmarkdown. Rmarkdown is the R version of Markdown which allows for easy-to-read, easy-to-write plain text format and has been lauded by data analysts for its ease of use (simple formatting, intiuitiveness, etc.)
There are many pros of using Rmarkdown:
There are some cons of using Rmarkdown
print("Hello World!")
## [1] "Hello World!"
The reason why Rmarkdown is so simple is because of the code chunks above. Rmarkdown will evaluate anything that is in a chunk. The code chunks will be evaluated as if the code was entered into the console! This is intuitive in that you don’t need to do anything special to include outputs, graphs, etc.
You can easily assign variables.
x = 2
x
## [1] 2
We can even write functions in Rmarkdown. Using previous class notes, we can put the SIR function in its own code chunk
sirmod=function(t, x, parms){
S=x[1]
I=x[2]
R=x[3]
beta=parms["beta"]
mu=parms["mu"]
gamma=parms["gamma"]
N=parms["N"]
dS = mu * (N - S) - beta * S * I / N
dI = beta * S * I / N - (mu + gamma) * I
dR = gamma * I - mu * R
res=c(dS, dI, dR)
list(res)
}
These are the times, parameters, and the starting values.
times = seq(0, 26, by=1/10)
parms = c(mu = 0, N = 1, beta = 2, gamma = 1/2)
start = c(S=0.999, I=0.001, R = 0)
We are able to see the output
out=ode(y=start,
times=times,
func=sirmod,
parms=parms)
out=as.data.frame(out)
head(round(out, 3))
## time S I R
## 1 0.0 0.999 0.001 0
## 2 0.1 0.999 0.001 0
## 3 0.2 0.999 0.001 0
## 4 0.3 0.998 0.002 0
## 5 0.4 0.998 0.002 0
## 6 0.5 0.998 0.002 0
Adding figures is quite easy in Rmarkdown. If your chunk has a code that generates a graph, Rmarkdown will insert the image! Note that not every plot needs it own code chunk. If several plot codes are in a single chunk, than Rmarkdown will simply insert the graph one after the other.
plot(x=out$time, y=out$S, ylab="fraction", xlab="time", type="l")
lines(x=out$time, y=out$I, col="red")
lines(x=out$time, y=out$R, col="green")
You also have the added flexibility of adjusting the height and width of the figure. If you include fig.width for your chunk option then you can change your figure’s width.
plot(x=out$time, y=out$S, ylab="fraction", xlab="time", type="l")
lines(x=out$time, y=out$I, col="red")
lines(x=out$time, y=out$R, col="green")
Let’s create some ridiculously fake data to show how important Rmarkdown is for statistical analyses.
test1=data.frame(ind = seq(1,10), dep = seq(1,10))
plot(test1,type='o')
test1lm <- lm(dep~ind,data=test1)
summary(test1lm)
## Warning in summary.lm(test1lm): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = dep ~ ind, data = test1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.661e-16 -1.157e-16 4.273e-17 2.153e-16 4.167e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.123e-15 2.458e-16 4.571e+00 0.00182 **
## ind 1.000e+00 3.961e-17 2.525e+16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.598e-16 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 6.374e+32 on 1 and 8 DF, p-value: < 2.2e-16
If you learn a bit of LaTex (Learning how to write equations takes about <30 minutes), you can also write in mathematical equations directly unto the document.
Here is the SIR equations:
\[\frac{dS}{dt} = \mu(N-S) - \frac{\beta IS}{N}\] \[\frac{dI}{dt} = \frac{\beta IS}{N} - (\mu+\gamma)I\] \[\frac{dR}{dt} = \gamma I - \mu R\]
You can also add in pictures (From your computer or online) and gifs are possible as well. (Though note only in HTML)
In the ggplot2 section, I will show how to make tables both manually and automatically
By learning a bit of basic HTML, LaTex, and RMarkdown, you can truly make a dynamic report.
The aim of the grammar is to “bring together in a coherent way things that previously appeared unrelated and which also will provide a basis for dealing systematically with new situations” (Cox 1978). How well have we succeeded? (Wickham, 2012)
The emphasis in ggplot2 is reducing the amount of thinking time by making it easier to go from the plot in your brain to the plot on the page." (Wickham, 2012)
Base graphics are good for drawing pictures; ggplot2 graphics are good for understanding the data." (Wickham, 2012)
Both a package and a phenomenon, ggplot was created by Hadley Wickham (2005). This package allows for flexible data visualization by implementing the grammar of graphics (Leland Wilkonson) . While known for its difficulty curve in the beginning, once you know the basic grammar (and with the help of Google) you can truly make some beautiful graphs.
Let’s use the data that is included in ggplot2 {diamonds}
| Variable | Description |
|---|---|
| Price | Price in US dollars ($326-$18,823) |
| Carat | Weight of the diamond (0.2-5.01) |
| Cut | Quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
| Colour | Diamond colour, from J (worst) to D (best) |
| Clarity | How clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best)) |
| x | Length in mm (0-10.74) |
| y | Width in mm (0-58.9) |
| z | Wepth in mm (0-31.8) |
| depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) |
| table | width of top of diamond relative to widest point (43-95) |
Let’s look at the first few entries
kable(head(diamonds))
| carat | cut | color | clarity | depth | table | price | x | y | z |
|---|---|---|---|---|---|---|---|---|---|
| 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
| 0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
| 0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
| 0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
| 0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
| 0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
Let’s put our ggplot2 into action. Let’s see the relationship between the price and carat of the diamond. Here is the simple ggplot code:
ggplot(diamonds, aes(x=carat,y=price))+geom_point()
Borrowing from another source ( Thomas Hopper), a ggplot2 graph is built from a few basic elements
The first part of the ggplot code is the data. This will be the foundation for the rest of the code. We are saying that we will be using the dataset {diamonds}, and aes (stands for aesthetics) means that we are putting our x (carat) in the x-axis and y (price) into the y-axis. Now, we add the geometric layer which will be the points (geom_point).
The best way to think about ggplot2 is the act of adding layers upon layers.
Now, let’s make it more complicated.
I’m curious if there is a pattern with cut as well. Now I’m going to say to ggplot: please assign different colors to different types of cuts, colors (of diamond), and clarity.
ggplot(diamonds,aes(x=carat,y=price,color=cut))+geom_point()
ggplot(diamonds,aes(x=carat,y=price,color=color))+geom_point()
ggplot(diamonds,aes(x= carat,y=price,color = clarity))+geom_point()
Interestingly enough, when you make these plots, you can store the plot objects into variables. Thus I am able to add another layer to the variable, I want to add a theme layer that changes the background color and add some lines.
cut.caratprice <- ggplot(diamonds,aes(x=carat,y=price,color=cut))+geom_point()
color.caratprice<-ggplot(diamonds,aes(x=carat,y=price,color=color))+geom_point()
clarity.caratprice<-ggplot(diamonds,aes(x= carat,y=price,color = clarity))+geom_point()
cut.caratprice+theme_bw()
color.caratprice + geom_line()+theme_bw()
Now I’m interested in seeing if there is a relationship between carat and depth, I want the cut of the diamond to be color-coded and the color of the diamond to be related to the size of the points. This shows how versatile ggplot2 can be.
ggplot(diamonds,aes(x=carat,y=depth,color=cut,size=color))+geom_point()+theme_bw()
## Warning: Using size for a discrete variable is not advised.
One highlight of ggplot is that you can skilfully change the details of your graph. To showcase this, let’s look at the United States economic time series.
data(economics)
kable(head(economics))
| date | pce | pop | psavert | uempmed | unemploy |
|---|---|---|---|---|---|
| 1967-07-01 | 507.4 | 198712 | 12.5 | 4.5 | 2944 |
| 1967-08-01 | 510.5 | 198911 | 12.5 | 4.7 | 2945 |
| 1967-09-01 | 516.3 | 199113 | 11.7 | 4.6 | 2958 |
| 1967-10-01 | 512.9 | 199311 | 12.5 | 4.9 | 3143 |
| 1967-11-01 | 518.1 | 199498 | 12.5 | 4.7 | 3066 |
| 1967-12-01 | 525.8 | 199657 | 12.1 | 4.8 | 3018 |
Creating a time series to show how the number of unemployment changes over the years.
ggplot(economics, aes(x= date, y= unemploy))+geom_point()+geom_line()
Hm… I’m not crazy about the grey background, I want to change the color and size of the points, and the linetype. So I am able to change aspects of the graph by changing the layers associated with it. The graph is not aesthetically pleasing (Most journals will probably reject it) but I think by making this unattractive graph I am able to show the control we can have!
ggplot(economics,aes(x=date,y=unemploy))+ geom_point(size = 1.5,colour='orange')+geom_line(size=1,alpha=0.7,linetype=2)+theme_bw()+labs(x="Date",y="Unemployment (In thousands")+ggtitle("Unemployment in the USA")+
geom_vline(xintercept= as.numeric(as.Date("2007-01-01")),color='red')
We can also do smoothing with different methods (loess is the default with n < 1000)
ggplot(economics, aes(x=date,y=unemploy))+geom_point()+
geom_line()+stat_smooth(fill="yellow")+theme_bw()
Here is a great infographic on how to change the details of the graph: Cheatsheet
Let’s look at this car dataset {mpg} that is included in ggplot2 as well.
kable(head(mpg))
| manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
|---|---|---|---|---|---|---|---|---|---|---|
| audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
| audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
| audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
| audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
I’m interested in seeing how many counts of different car classes there are.
The stat is another part of the grammar and are related to summaries (quantile, sums, so on). In this code, we make the height of the bar equal to the number of cases in each group.
ggplot(mpg, aes(x= factor(class)))+geom_bar(stat='count',fill='navy')+theme_bw()
ggplot(mpg, aes(x= factor(manufacturer)))+geom_bar(stat='count',fill='gold')+theme_bw()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
From our two graphs we can see that in the data, SUVs make up the bulk of the data and Dodge is the most represented.
Boxplots are also fairly simple to do in ggplot.
ggplot(mpg, aes(x=factor(class),y= hwy))+geom_boxplot(aes(fill= class)) +theme_bw()
ggplot(mpg, aes(x= factor(manufacturer),y=hwy))+geom_boxplot(aes(fill=manufacturer))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Faceting is also a useful part of ggplot2. Faceting is where you subset up your data by one or more variables than plot it together
ggplot(mtcars, aes(x=mpg,y=wt ))+geom_point(size=3)+geom_line()+theme_bw()+facet_grid(.~am)
You can facet with more than one variable as well
ggplot(mtcars,aes(x= mpg, y= wt))+geom_point(size=3)+geom_line()+theme_bw()+facet_grid(am~cyl)
The best way to pra and try to recreate them with ggplot. For simple graphs, you may be thinking that using the basic plot function is easier but there is ggplot’s qplot (quick plot) that has the same aesthetics as the full ggplot.
Here is the scatter plot:
qplot(mpg,wt,data=mtcars)
Here is the line plot:
qplot (mpg,wt,data=mtcars,geom='line')
But let’s try converting the parasitoid/host plot into a ggplot. One caveat about ggplot2 is that data needs to be in long format. We can fix this problem by using reshape2 and melting our data (Another package by Wilkham)
burnett = read.csv("burnettparasitoid.csv")
burn=melt(burnett,id.vars= "Generation")
We are only looking at parasitoids and hosts so let’s subset.
burn2 = subset(burn,variable == c("NumberofHostsParasitized", "NumberofHostsUnparasitized"))
plot(burnett$Generation, burnett$NumberofHostsParasitized, type="b", ylab="numbers", xlab="generation")
lines(burnett$Generation, burnett$NumberofHostsUnparasitized, type="b", col=2, pch=2)
legend("topleft",
legend=c("Parasitoid", "Host"), lty=c(1,1), pch=c(1,2), col=c(1,2))
ggplot(burn2,aes(x=Generation,y =value, color=variable)) + geom_point(aes(shape=factor(variable)),size =3)+scale_shape(solid=FALSE) +geom_line()+theme_bw()+labs(y='Numbers')+
scale_color_manual(values=c("red","black"))
Not an exact copy, but this shows the difference between basic plot and ggplot2!