Constructing the query

While I was in the process of data entry, I noticed that many of my samples contained unquantified results. Instead, the analyst listed species present in each sample and gave ranges that are difficult to compare with results from other reports. This query creates visualizations to demonstrate how widespread this problem was in my research. It also demonstrates the workflow of writing the query in SQL, executing it in R and data wrangling in R. The primary benefit of this workflow is that the script can repeat the exact data analysis process if I update my datatables.

library(sqldf)
library(ggplot2)

This section loads the CSV files. When I originally saved them from Excel, something odd happened to the formatting of the first column of each CSV, so the script also corrects this problem. In order to construct multiple plots on the same image, I used the multiplot function as defined here. It does not come included in the ggplot2 package and must defined for each session that it is used.

report_info <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/reportinfo.csv")
occupation_data <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/occupationdata.csv")
excav_info <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/excavationinfo.csv")
biblio <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/bibliographic.csv")
finds <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/archaeobotfinds.csv")
planttype <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/planttype.csv")
sampling <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/samplinglog.csv")
sites_sp <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/sitesspatial.csv")

names(biblio)[1] <- "her_no"
names(excav_info)[1] <- "her_no"
names(finds)[1] <- "bot_finds_id"
names(occupation_data)[1] <- "her_no"
names(planttype)[1] <- "family"
names(report_info)[1] <- "report_type"
names(sampling)[1] <- "sample_id"
sites_sp <- sites_sp[,2:8]

Querying the Data

Because my data joins across several tables, I found it easiest to pull information from across several tables in the SQL query language (rather than writing a series of joins and subsets in R). The sqldf package allowed me to write the query in SQL and execute in in R. I saved the results as a data frame and found it easy to manipulate and plot.

#select all samples and their site code information if they have unquantified finds
unquant <- sqldf("SELECT DISTINCT sampling.her_no, sampling.her_cit, sampling.sample_id FROM sampling JOIN finds ON (sampling.sample_id=finds.sample_id)  WHERE notes='unquant'")

#create a new column and assign a value to it.  This will mark unquantified samples from quantified samples in future steps
unquant$unquant <- "unquant"

#merge with the original table of samples and keep all records.  quantified samples will show up as NA in the "unquant" column.
#set the unquant column value to "quant" for all blank (quantified) samples
unquant_sites <- merge(sampling, unquant, by = c("her_no", "her_cit", "sample_id"), all.x=TRUE)
unquant_sites$unquant[is.na(unquant_sites$unquant)] <- "quantified"

#create a table of unquantified samples by site code. remove duplicate sites
unquant_sites[!duplicated(unquant_sites[1:2]),] -> unquant_sites_unique

With only a few steps, I’ve created a table of sample data that distinguishes unquantified samples from quantified samples and sites with unquantified samples from sites with unquantified samples. I can now plot these results to see how the counts of unquantified samples compare to quantified samples.

ggplot(unquant_sites, aes(x=unquant, fill=unquant)) + geom_bar(stat="count") + labs(x="Unquantified Samples") + scale_fill_discrete(name="Legend", labels=c("quantified", "unquantified"))-> p1
ggplot(unquant_sites_unique, aes(x=unquant, fill=unquant)) + geom_bar(stat="count") + labs(x="Unquantified Samples by Site") + scale_fill_discrete(name="Legend", labels=c("quantified", "unquantified"))-> p2

multiplot(p1, p2)

#write.csv(unquant, file= "~/Grad Year 3/Advanced Data Structures/output/unquant.csv")

Samples and Sites by Phase

In previous research, I noticed a bias of samples towards the Roman period, even on sites occupied over multiple phases. I constructed a simple query to present a direct comparison of the counts of samples by phase to the counts of sites by phase. In order to facilitate lining up the colors and positions of the bars on the histogram, I eliminated the three time periods for which there were no samples taken (Bronze Age, Neolithic, and Mid/Late Saxon). The plots below clearly demonstrate a disporportionate number of samples collected from the Roman period despite the comparatively few sites with a Roman component.

samples_phase <- sqldf("SELECT sampling.her_no, sampling.her_cit, sampling.sample_id, phase FROM sampling JOIN occupation_data ON (sampling.her_no=occupation_data.her_no AND sampling.her_cit=occupation_data.her_cit AND sampling.context=occupation_data.context)")
p3 <- ggplot(samples_phase, aes(x=phase, fill=phase))+geom_bar(stat="count") + labs(x="Count of Samples by Phase")

sites_phase_all <- subset(occupation_data, select=c("her_no", "her_cit", "phase"))
sites_phase_all <- unique(sites_phase_all)
sites_phase_all <- sites_phase_all[sites_phase_all$phase != 'bronze' & sites_phase_all$phase != 'neolithic' & sites_phase_all$phase !='mid/late saxon',]
p4 <- ggplot(sites_phase_all, aes(x=phase, fill=phase))+geom_bar(stat="count") + labs(x="Count of Sites by Phase")

multiplot(p3, p4)