You can see a rendered version (html) of this document in RPubs.

Rrrresources

FMI on the nitty gritty of R and general tips/tricks:

A humorous (and insightful) take comes from seasoned python programmers who feel very Arrrgh about R

Let us know if you’d like more resources!

Issues with installation

Installing tabulizer can be difficult due to underlying problems with rJava that can be frustratingly opaque and difficult to diagnose. Here are my notes for how I managed to get rJava and subsequently, tabulizer to install properly.

  1. It looks like I can’t get tabulizer to install because of pointing issues and/or version issues for the Java distribution that rJava is trying to point to.
    1. Needs to be Java 1.7 or 1.8 on up, but either on the tabulizer or tabula (tabula is the Java library that tabulizer imports into R) end, it seems like it gets rJava to default point to v 1.6.x if it is on the machine.
  2. Open terminal and type the following:
    1. MacBook-Pro-5:~ Char$ export JAVA_HOME=\x60/usr/libexec/java_home -v 1.8\x60
    2. MacBook-Pro-5:~ Char$ export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/server
    3. MacBook-Pro-5:~ Char$ env
  3. From the env command, you should now see:
    1. PWD=/Users/Char
    2. JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home
    3. LANG=en_US.UTF-8
  4. In terminal call open -a RStudio
  5. In Rstudio, call:
# install.packages(rJava)
library(rJava)
                .jinit()
                .jcall(“java/lang/System”, “S”, “getProperty”, “java.runtime.version”)

Examples of using tabulizer and digitize

Importing an example figure

Below, we will use the googledrive package to import the file Galetti1A.png from our team drive. We will subsequently call digitize to extract information from the barchart.

###===========================================================================
### Importing and downloading figure from Google Drive Team Folder
###===========================================================================
galetti1A_loc <- drive_get(id="10CXIOQAIIoC8Z9K_xjx97PwB63XEU-YR", team_drive = "CoharvestSRE") # Location of Galetti1A figure
galetti1A_fig <- drive_download(file=galetti1A_loc, path="~/GoogleDrive/Rscripts/SRE_coharvest/Galetti1A.png", overwrite=TRUE) # Downloading the figure to your local folder (note that you can skip these two steps and just directly point and click download instead or work from your local machine)

###===========================================================================
### Digitizing plot
###===========================================================================

galetti1A <- digitize::digitize(image_filename=galetti1A_fig$local_path) # Note that you could instead directly specify a local path and bypass the two steps above. Then it would be:
# galetti1A <- digitize::digitize(image_filename="~/GoogleDrive/Rscripts/SRE_coharvest/Galetti1A.png")

# save(galetti1A, file="~/GoogleDrive/Rscripts/SRE_coharvest/galetti1Adigitize.RData") # save the output to a serialized file

dev.copy(png,"~/Downloads/galetti1A-digitized.png", width=4, height=3, units="in", res=120, bg="transparent") # store the output to file; this is different from the usual routine because the digitizer package needs unencumbered access to the "Plots" tab in RStudio (not sure how it works in base R)
dev.off()

We can inspect the output by displaying it here:

galetti1A

galetti1A

###===========================================================================
### Extracting data
###===========================================================================
attach("~/GoogleDrive/Rscripts/SRE_coharvest/galetti1Adigitize.RData")
galetti1A # we don't care about the X values; these are just dummy values.
##            x         y
## 1  1.0000000 54.855196
## 2  0.9905914 76.490630
## 3  1.9973118 36.286201
## 4  1.9784946 86.541738
## 5  3.0134409 26.746167
## 6  2.9946237 93.015332
## 7  4.0013441  2.385009
## 8  4.0201613 50.085179
## 9  5.0080645 15.502555
## 10 4.9986559 23.339012
## 11 5.9771505 69.846678
## 12 5.9865591 71.890971
## 13 6.9838710  5.110733
## 14 6.9650538  9.199319
## 15 7.9905914 10.221465
## 16 7.9811828 11.413969
### 1: Pulling the y-column out
galetti1A_data <- galetti1A$y # we only care about the y-axis as this is a barplot with categories on the X

### 2: Pushing these values into a data.frame
galetti1A_df <- data.frame(Intact=galetti1A_data[c(TRUE,FALSE)],
                           Rodents=galetti1A_data[c(FALSE,TRUE)]-galetti1A_data[c(TRUE,FALSE)],
                           Insects=100-galetti1A_data[c(FALSE,TRUE)])
rownames(galetti1A_df) <- c("JI","Pi","Ca","PA","UN","XJ","SH","Af")

### 3: Displaying the table in a nice format (note that you need to run install.packages("htmlTable"))
htmlTable::htmlTable( round(galetti1A_df,1), # round to 1 decimal point, also see ?signif
                      rowlabel="Sites")
Sites Intact Rodents Insects
JI 54.9 21.6 23.5
Pi 36.3 50.3 13.5
Ca 26.7 66.3 7
PA 2.4 47.7 49.9
UN 15.5 7.8 76.7
XJ 69.8 2 28.1
SH 5.1 4.1 90.8
Af 10.2 1.2 88.6

My guess from this type of data is what is ultimately useful is:

Tabulizer example: Peres et al., Brazil nut harvest

In Table S1, Carlos Peres and co-authors present information on the density and population structure of Brazil nut populations.

Steps:

###===========================================================================
### Importing table from PDF
###===========================================================================

peresPDF <- drive_find(pattern="Peres-BrazilNut-Science", team_drive="CoharvestSRE")
peresPDFurl <- drive_link(peresPDF)
# peresS1 <- tabulizer::extract_tables(file=peresPDFurl, pages=10) # uffda, sadly does not quite work. Will have to spend some time troubleshooting
peresS1 <- tabulizer::extract_tables(file="/Volumes/GoogleDrive/Team Drives/CoharvestSRE/Readings/Parameters/Peres-BrazilNut-Science.full.pdf", pages=10)
 
save(peresS1, file="~/GoogleDrive/Rscripts/SRE_coharvest/peresS1.RData")
attach("~/GoogleDrive/Rscripts/SRE_coharvest/peresS1.RData")
###===========================================================================
### Cleaning table
###===========================================================================
head(peresS1[[1]]) # wow it looks AWFUL!
##      [,1]                                                                                                                      
## [1,] "Table S1.  Brazil nut tree (Bertholletia excelsa) populations examined in this study.  Site numbers refer to those show" 
## [2,] "corresponds to the aggregate area of forest plots and/or transect strip-width censused at each site.  NA = data not avai"
## [3,] ""                                                                                                                        
## [4,] ""                                                                                                                        
## [5,] "Site Latitude, Area sampled Density Mean ± SE s* a % No. of tree"                                                        
## [6,] "No. Site locality State, Country Longitude (ha) (ind. ha–1) DBH (cm) Juveniles sampled"
peresS1list <- peresS1[[1]][7:38,]
peresS1list <- lapply(peresS1list, function(x) {unlist(strsplit(x," "))})

peres_entries <- unlist(lapply(peresS1list, length)) # the table import was wonky so it is incorrectly splitting along newlines
peres_rows <- which(peres_entries > 3)

peresS1df <- NULL # run a for loop below to extract data from the tabulizer object
for (i in 1:length(peres_rows)) {
  peresS1df <- rbind(peresS1df, tail(peresS1list[[peres_rows[i]]], 10))
}

peresS1df <- data.frame(peresS1df, stringsAsFactors = FALSE) # convert output from matrix to data.frame format
names(peresS1df) <- c("Latitude","Longitude","AreaHectares","Density-treeHA-1","MeanDBH","pm","SE_DBH","CumS*","PercentJuve","NoTreeSampled")
peresS1df[c(3:5,7:10)] <- apply(peresS1df[c(3:5,7:10)], 2, as.numeric) # convert several columns to numeric class
## Warning in apply(peresS1df[c(3:5, 7:10)], 2, as.numeric): NAs introduced by
## coercion
peresS1df$HuntIntensity <- c("U","U","P","L","L","P","P","L","M","L","M",rep("U",2),rep("L",6),rep("M",2),"U","M") # argh, this column didn't import so we get to hand impute...lovely

###===========================================================================
### Displaying table
###===========================================================================
htmlTable::htmlTable(peresS1df, rnames=c("Pinkaiti","Kranure","Saraca","Maraba","Tapajos","Alto Cajar","Iratapuru","Aventura","Ussicanta","Lago Ciputuba","Amana","Rio Cristalino","Claudia","Nova Esperanca","Colocacao Tucuma","Colocacao Rio","Encontro","Oculto","Limon","El Tigre","El Sena","Alter do Chao","Rio Ouro Preto"), rowlabel="Sites")
Sites Latitude Longitude AreaHectares Density-treeHA-1 MeanDBH pm SE_DBH CumS* PercentJuve NoTreeSampled HuntIntensity
Pinkaiti 7°46’S, 51°57’W 60 3.3 72.6 ± 3.8 1.886 43.3 224 U
Kranure 7°49’S, 51°55’W 16 3.4 65.7 ± 7.6 1.92 52.5 40 U
Saraca 1°45’S, 56°30’W 769 1.5 134.8 ± 1.4 -1.1 1.6 1165 P
Maraba 5°12’S, 49°06’W 9 4.3 119.4 ± 11.7 1.572 33.3 39 L
Tapajos 3°55’S, 55°28’W 100 0.7 73.8 ± 2.9 1.438 35.7 269 L
Alto Cajar 0°20’S, 51°43’W 23 12 156.4 ± 2.9 -1.85 0.7 272 P
Iratapuru 0°03’N, 52°30’W 25 9.4 154.1 ± 3 -1.779 0.9 230 P
Aventura 4°19’S, 62°30’W 25.8 6.8 102.3 ± 4.3 0.315 24.6 122 L
Ussicanta 4°21’S, 62°35’W 15.4 8.5 133.5 ± 3.04 -2.582 3.8 132 M
Lago Ciputuba 5°48’S, 60°13’W 114 1.8 116.6 ± 3.91 0.348 22.4 201 L
Amana 2°21’S, 64°45’W 50 1.4 123.5 ± 4.1 -1.891 5.3 76 M
Rio Cristalino 9°28’S, 55°55’W 64 4.9 90.7 ± 6.1 1.691 40.7 113 U
Claudia 11°30’S, 54°48’W 35 3.6 71 ± 4.5 31.2 125 U
Nova Esperanca 10°40’S, 68°30’W 51 3.1 73.9 ± 4.3 1.809 47.2 161 L
Colocacao Tucuma 10°51’S, 68°44’W 184.5 1.4 108 ± 2.4 -0.783 12.2 255 L
Colocacao Rio 10°47’S, 68°40’W 378 1.4 89.8 ± 1.9 0.44 32.4 568 L
Encontro 10°43’S, 68°50’W 211 1.4 91.4 ± 2.5 -1.029 25.1 295 L
Oculto 12°39’S, 68°56’W 925 0.7 109.9 ± 1.5 -0.974 10.9 613 L
Limon 12°32’S, 68°52’W 1350 0.1 108.2 ± 2.9 -0.587 10.6 160 L
El Tigre 11°30’S; 67°15’W 12 1.7 102.1 ± 3 0.162 25 20 M
El Sena 10°59’S, 65°43’W 7 3.3 111.3 ± 3.7 0.401 21.7 23 M
Alter do Chao 2°33’S, 54°53’W 3 23 44.9 ± 5.3 2.509 75.6 78 U
Rio Ouro Preto 10°45’S, 65°30’W 12 2 127.7 ± 7.3 -1.917 4.5 22 M

In this case, what we likely care about is:

To be honest, however, this data set is related but a bit orthogonal IMO to the type of data that we would need.