Tabulizer and Digitizer examples

Rrrresources

FMI on the nitty gritty of R and general tips/tricks:

A humorous (and insightful) take comes from seasoned python programmers who feel very Arrrgh about R

Let us know if you’d like more resources!

Issues with installation

Installing tabulizer can be difficult due to underlying problems with rJava that can be frustratingly opaque and difficult to diagnose. Here are my notes for how I managed to get rJava and subsequently, tabulizer to install properly.

It looks like I can’t get tabulizer to install because of pointing issues and/or version issues for the Java distribution that rJava is trying to point to.
1. Needs to be Java 1.7 or 1.8 on up, but either on the tabulizer or tabula (tabula is the Java library that tabulizer imports into R) end, it seems like it gets rJava to default point to v 1.6.x if it is on the machine.
Open terminal and type the following:
1. MacBook-Pro-5:~ Char$ export JAVA_HOME=\x60/usr/libexec/java_home -v 1.8\x60
2. MacBook-Pro-5:~ Char$ export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/server
3. MacBook-Pro-5:~ Char$ env
From the env command, you should now see:
1. PWD=/Users/Char
2. JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home
3. LANG=en_US.UTF-8
In terminal call open -a RStudio
In Rstudio, call:

# install.packages(rJava)
library(rJava)
                .jinit()
                .jcall(“java/lang/System”, “S”, “getProperty”, “java.runtime.version”)

Examples of using `tabulizer` and `digitize`

Importing an example figure

Below, we will use the googledrive package to import the file Galetti1A.png from our team drive. We will subsequently call digitize to extract information from the barchart.

###===========================================================================
### Importing and downloading figure from Google Drive Team Folder
###===========================================================================
galetti1A_loc <- drive_get(id="10CXIOQAIIoC8Z9K_xjx97PwB63XEU-YR", team_drive = "CoharvestSRE") # Location of Galetti1A figure
galetti1A_fig <- drive_download(file=galetti1A_loc, path="~/GoogleDrive/Rscripts/SRE_coharvest/Galetti1A.png", overwrite=TRUE) # Downloading the figure to your local folder (note that you can skip these two steps and just directly point and click download instead or work from your local machine)

###===========================================================================
### Digitizing plot
###===========================================================================

galetti1A <- digitize::digitize(image_filename=galetti1A_fig$local_path) # Note that you could instead directly specify a local path and bypass the two steps above. Then it would be:
# galetti1A <- digitize::digitize(image_filename="~/GoogleDrive/Rscripts/SRE_coharvest/Galetti1A.png")

# save(galetti1A, file="~/GoogleDrive/Rscripts/SRE_coharvest/galetti1Adigitize.RData") # save the output to a serialized file

dev.copy(png,"~/Downloads/galetti1A-digitized.png", width=4, height=3, units="in", res=120, bg="transparent") # store the output to file; this is different from the usual routine because the digitizer package needs unencumbered access to the "Plots" tab in RStudio (not sure how it works in base R)
dev.off()

We can inspect the output by displaying it here:

galetti1A

###===========================================================================
### Extracting data
###===========================================================================
attach("~/GoogleDrive/Rscripts/SRE_coharvest/galetti1Adigitize.RData")
galetti1A # we don't care about the X values; these are just dummy values.

##            x         y
## 1  1.0000000 54.855196
## 2  0.9905914 76.490630
## 3  1.9973118 36.286201
## 4  1.9784946 86.541738
## 5  3.0134409 26.746167
## 6  2.9946237 93.015332
## 7  4.0013441  2.385009
## 8  4.0201613 50.085179
## 9  5.0080645 15.502555
## 10 4.9986559 23.339012
## 11 5.9771505 69.846678
## 12 5.9865591 71.890971
## 13 6.9838710  5.110733
## 14 6.9650538  9.199319
## 15 7.9905914 10.221465
## 16 7.9811828 11.413969

### 1: Pulling the y-column out
galetti1A_data <- galetti1A$y # we only care about the y-axis as this is a barplot with categories on the X

### 2: Pushing these values into a data.frame
galetti1A_df <- data.frame(Intact=galetti1A_data[c(TRUE,FALSE)],
                           Rodents=galetti1A_data[c(FALSE,TRUE)]-galetti1A_data[c(TRUE,FALSE)],
                           Insects=100-galetti1A_data[c(FALSE,TRUE)])
rownames(galetti1A_df) <- c("JI","Pi","Ca","PA","UN","XJ","SH","Af")

### 3: Displaying the table in a nice format (note that you need to run install.packages("htmlTable"))
htmlTable::htmlTable( round(galetti1A_df,1), # round to 1 decimal point, also see ?signif
                      rowlabel="Sites")

Sites	Intact	Rodents	Insects
JI	54.9	21.6	23.5
Pi	36.3	50.3	13.5
Ca	26.7	66.3	7
PA	2.4	47.7	49.9
UN	15.5	7.8	76.7
XJ	69.8	2	28.1
SH	5.1	4.1	90.8
Af	10.2	1.2	88.6

My guess from this type of data is what is ultimately useful is:

What is dispersed (galetti1B; black cells)?
The relationship between hunting pressure and/or rodent density and the percentage of seeds that are predated (preyed upon by rodents in Galetti’s paper) versus dispersed (1B).

Tabulizer example: Peres et al., Brazil nut harvest

In Table S1, Carlos Peres and co-authors present information on the density and population structure of Brazil nut populations.

Steps:

Import the PDF from the Google Drive Team folder
Use tabulizer to import the table
Perform cleaning to get the table into a usable format

###===========================================================================
### Importing table from PDF
###===========================================================================

peresPDF <- drive_find(pattern="Peres-BrazilNut-Science", team_drive="CoharvestSRE")
peresPDFurl <- drive_link(peresPDF)
# peresS1 <- tabulizer::extract_tables(file=peresPDFurl, pages=10) # uffda, sadly does not quite work. Will have to spend some time troubleshooting
peresS1 <- tabulizer::extract_tables(file="/Volumes/GoogleDrive/Team Drives/CoharvestSRE/Readings/Parameters/Peres-BrazilNut-Science.full.pdf", pages=10)
 
save(peresS1, file="~/GoogleDrive/Rscripts/SRE_coharvest/peresS1.RData")

attach("~/GoogleDrive/Rscripts/SRE_coharvest/peresS1.RData")
###===========================================================================
### Cleaning table
###===========================================================================
head(peresS1[[1]]) # wow it looks AWFUL!

##      [,1]                                                                                                                      
## [1,] "Table S1.  Brazil nut tree (Bertholletia excelsa) populations examined in this study.  Site numbers refer to those show" 
## [2,] "corresponds to the aggregate area of forest plots and/or transect strip-width censused at each site.  NA = data not avai"
## [3,] ""                                                                                                                        
## [4,] ""                                                                                                                        
## [5,] "Site Latitude, Area sampled Density Mean ± SE s* a % No. of tree"                                                        
## [6,] "No. Site locality State, Country Longitude (ha) (ind. ha–1) DBH (cm) Juveniles sampled"

peresS1list <- peresS1[[1]][7:38,]
peresS1list <- lapply(peresS1list, function(x) {unlist(strsplit(x," "))})

peres_entries <- unlist(lapply(peresS1list, length)) # the table import was wonky so it is incorrectly splitting along newlines
peres_rows <- which(peres_entries > 3)

peresS1df <- NULL # run a for loop below to extract data from the tabulizer object
for (i in 1:length(peres_rows)) {
  peresS1df <- rbind(peresS1df, tail(peresS1list[[peres_rows[i]]], 10))
}

peresS1df <- data.frame(peresS1df, stringsAsFactors = FALSE) # convert output from matrix to data.frame format
names(peresS1df) <- c("Latitude","Longitude","AreaHectares","Density-treeHA-1","MeanDBH","pm","SE_DBH","CumS*","PercentJuve","NoTreeSampled")
peresS1df[c(3:5,7:10)] <- apply(peresS1df[c(3:5,7:10)], 2, as.numeric) # convert several columns to numeric class

## Warning in apply(peresS1df[c(3:5, 7:10)], 2, as.numeric): NAs introduced by
## coercion

peresS1df$HuntIntensity <- c("U","U","P","L","L","P","P","L","M","L","M",rep("U",2),rep("L",6),rep("M",2),"U","M") # argh, this column didn't import so we get to hand impute...lovely

###===========================================================================
### Displaying table
###===========================================================================
htmlTable::htmlTable(peresS1df, rnames=c("Pinkaiti","Kranure","Saraca","Maraba","Tapajos","Alto Cajar","Iratapuru","Aventura","Ussicanta","Lago Ciputuba","Amana","Rio Cristalino","Claudia","Nova Esperanca","Colocacao Tucuma","Colocacao Rio","Encontro","Oculto","Limon","El Tigre","El Sena","Alter do Chao","Rio Ouro Preto"), rowlabel="Sites")

Sites	Latitude	Longitude	AreaHectares	Density-treeHA-1	MeanDBH	pm	SE_DBH	CumS*	PercentJuve	NoTreeSampled	HuntIntensity
Pinkaiti	7°46’S,	51°57’W	60	3.3	72.6	±	3.8	1.886	43.3	224	U
Kranure	7°49’S,	51°55’W	16	3.4	65.7	±	7.6	1.92	52.5	40	U
Saraca	1°45’S,	56°30’W	769	1.5	134.8	±	1.4	-1.1	1.6	1165	P
Maraba	5°12’S,	49°06’W	9	4.3	119.4	±	11.7	1.572	33.3	39	L
Tapajos	3°55’S,	55°28’W	100	0.7	73.8	±	2.9	1.438	35.7	269	L
Alto Cajar	0°20’S,	51°43’W	23	12	156.4	±	2.9	-1.85	0.7	272	P
Iratapuru	0°03’N,	52°30’W	25	9.4	154.1	±	3	-1.779	0.9	230	P
Aventura	4°19’S,	62°30’W	25.8	6.8	102.3	±	4.3	0.315	24.6	122	L
Ussicanta	4°21’S,	62°35’W	15.4	8.5	133.5	±	3.04	-2.582	3.8	132	M
Lago Ciputuba	5°48’S,	60°13’W	114	1.8	116.6	±	3.91	0.348	22.4	201	L
Amana	2°21’S,	64°45’W	50	1.4	123.5	±	4.1	-1.891	5.3	76	M
Rio Cristalino	9°28’S,	55°55’W	64	4.9	90.7	±	6.1	1.691	40.7	113	U
Claudia	11°30’S,	54°48’W	35	3.6	71	±	4.5		31.2	125	U
Nova Esperanca	10°40’S,	68°30’W	51	3.1	73.9	±	4.3	1.809	47.2	161	L
Colocacao Tucuma	10°51’S,	68°44’W	184.5	1.4	108	±	2.4	-0.783	12.2	255	L
Colocacao Rio	10°47’S,	68°40’W	378	1.4	89.8	±	1.9	0.44	32.4	568	L
Encontro	10°43’S,	68°50’W	211	1.4	91.4	±	2.5	-1.029	25.1	295	L
Oculto	12°39’S,	68°56’W	925	0.7	109.9	±	1.5	-0.974	10.9	613	L
Limon	12°32’S,	68°52’W	1350	0.1	108.2	±	2.9	-0.587	10.6	160	L
El Tigre	11°30’S;	67°15’W	12	1.7	102.1	±	3	0.162	25	20	M
El Sena	10°59’S,	65°43’W	7	3.3	111.3	±	3.7	0.401	21.7	23	M
Alter do Chao	2°33’S,	54°53’W	3	23	44.9	±	5.3	2.509	75.6	78	U
Rio Ouro Preto	10°45’S,	65°30’W	12	2	127.7	±	7.3	-1.917	4.5	22	M

In this case, what we likely care about is:

Do they specify what “hunt intensity” means? Is it a measure of how many seeds were removed?
Is there information on agouti density along with the hunt intensity and juvenile composition data?

To be honest, however, this data set is related but a bit orthogonal IMO to the type of data that we would need.

Tabulizer and Digitizer examples

KEA - SPA

6/5/2018

Rrrresources

Issues with installation

Examples of using `tabulizer` and `digitize`

Importing an example figure

Tabulizer example: Peres et al., Brazil nut harvest

Tabulizer and Digitizer examples

KEA - SPA

6/5/2018

Rrrresources

Issues with installation

Examples of using tabulizer and digitize

Importing an example figure

Tabulizer example: Peres et al., Brazil nut harvest

Examples of using `tabulizer` and `digitize`