A Model for more complex queries

This markdown largely models the direction I would have liked to go with my data if the quality of the available archaeobotanical data was better. I consulted ethographic studies of cereal processing in Greece. The author (Jones 1987, Jones 1990) used the ethnographic studies as a basis of comparison to archaeological samples in order to infer the stage of processing at which the archaeological sample was deposited. Although I could not access the original study in time for this week’s update (Jones 1984), I was able to infer enough from the publications to come up with reasonable parameters for distinguishing fully processed cereals from processing debris.

library(sqldf)
library(leaflet)
library(maptools)
library(raster)

#I decided loaded all 8 data tables even though I only drew from 3 for the purposes of this work. 
report_info <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/reportinfo.csv")
occupation_data <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/occupationdata.csv")
excav_info <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/excavationinfo.csv")
biblio <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/bibliographic.csv")
finds <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/archaeobotfinds.csv")
planttype <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/planttype.csv")
sampling <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/samplinglog.csv")
sites_sp <- read.csv("~/Grad Year 3/Advanced Data Structures/LincolnshireArchaeobotany/sitesspatial.csv")

#there was a problem loading the first column name in the datatable, so this manually writes over the correct name of the first column for each table.
names(biblio)[1] <- "her_no"
names(excav_info)[1] <- "her_no"
names(finds)[1] <- "bot_finds_id"
names(occupation_data)[1] <- "her_no"
names(planttype)[1] <- "family"
names(report_info)[1] <- "report_type"
names(sampling)[1] <- "sample_id"
names(sites_sp)[1] <- "her_no"

SQL queries

I used sqldf to query my datatables because it was much easier to pull information from across multiple joins. The first parameter of applying ethnographic data to archaeobotanical samples requires having a sufficient number of cereals parts (grain, rachis, etc.) present within the sample for analysis. I constructed a query to first select based on the sum of all cereal parts for the four major cereal crops (wheat, barley, oats, and rye) per sample.

sample_cereal_sum <- sqldf("SELECT sampling.sample_id, SUM(finds.count) AS finds_count FROM sampling LEFT JOIN finds ON (sampling.sample_id=finds.sample_id) WHERE genus IN ('Hordeum', 'Triticum', 'Avena', 'Secale') GROUP BY sampling.sample_id ORDER BY SUM(finds.count) DESC")

head(sample_cereal_sum, n=15)
##    sample_id finds_count
## 1       S044        2019
## 2       S067         984
## 3       S068         873
## 4       S057         173
## 5       S063         149
## 6       S056          87
## 7       S058          81
## 8       S095          44
## 9       S065          36
## 10      S045          35
## 11      S064          35
## 12      S047          29
## 13      S060          24
## 14      S099          24
## 15      S059          21

Based on the original publications, I should have about 100 cereal parts in order to proceed with comparisons to ethnographic samples. I decided to set my cut at 80 cereal parts so that I could have a slightly larger data set. The next lowest sum of all cereal parts for a given sample was in the 30s, so 80 seemed to be a reasonable cut given the data that I was working with.

The ethnographic model (Jones 1990) utilizes percentages of various parts of the cerealin order to infer the phase of processing. The query below groups sums for each sample by the plant part of the cereal rather than providing a sum for all cereals as above. I then calculated the percentage of each cereal plant part of the whole cereal assemblage (ex: % of grains vs. % of rachis fragments)

sample_cereal_sum <- subset(sample_cereal_sum, finds_count > 80)

sample_plantpart <- sqldf("SELECT sample_cereal_sum.sample_id, plant_part, SUM(count) AS plantpartsum FROM sample_cereal_sum LEFT JOIN finds ON (sample_cereal_sum.sample_id=finds.sample_id) WHERE genus IN ('Hordeum', 'Triticum', 'Avena', 'Secale') GROUP BY plant_part, sample_cereal_sum.sample_id ORDER BY sample_cereal_sum.sample_id")

head(sample_cereal_sum, n=15)
##   sample_id finds_count
## 1      S044        2019
## 2      S067         984
## 3      S068         873
## 4      S057         173
## 5      S063         149
## 6      S056          87
## 7      S058          81
#merge with the previous table so that I have the total of all cereal parts
cereal_totals <- merge(sample_plantpart, sample_cereal_sum, by="sample_id")

head(cereal_totals, n=15)
##    sample_id plant_part plantpartsum finds_count
## 1       S044      glume           NA        2019
## 2       S044      grain         2019        2019
## 3       S044   plumules           NA        2019
## 4       S044     rachis           NA        2019
## 5       S044       stem           NA        2019
## 6       S056      grain           85          87
## 7       S056     rachis            2          87
## 8       S057      grain          170         173
## 9       S057     rachis            3         173
## 10      S058      grain           81          81
## 11      S063      grain          149         149
## 12      S067      grain          972         984
## 13      S067     rachis           12         984
## 14      S068      grain          840         873
## 15      S068     rachis           33         873
#calculation of percentage within each sample by plant part
cereal_totals$percent <- cereal_totals$plantpartsum/cereal_totals$finds_count

Interpretation

I currently have a preliminary method of assessing the stage at which the material became carbonized. I selected all samples for which the percentage of grains (vs. all ther cereal parts) was greater than 80%. All of the samples met this condition. This shows the samples that have sufficient numbers for ethnographic comparison were likely finished products. My comparison to the ethonographic data would be stronger if I incorporated the weed seed data, so this is only a preliminary conclusion.

grain_products <- sqldf("SELECT * FROM cereal_totals WHERE plant_part='grain' AND percent > 0.80")


#a query to connect the grain products data to the spatial information in order to create a map.  This requires joining across several tables.
grain_products_geo <- sqldf("SELECT report_info.her_no, report_info.her_cit, lat, lon, sampling.sample_id, percent FROM report_info JOIN occupation_data ON (report_info.her_no = occupation_data.her_no AND report_info.her_cit=occupation_data.her_cit) JOIN sampling ON (occupation_data.her_no= sampling.her_no AND occupation_data.her_cit= sampling.her_cit AND occupation_data.context=sampling.context) JOIN grain_products ON (sampling.sample_id = grain_products.sample_id)")
#conversion to a spatial object
coordinates(grain_products_geo) = ~lon + lat
projection(grain_products_geo) <- CRS("+proj=lonlat +ellps=WGS84")
#inclusion of all reports to compare distribution of all sites to the probable grain products.
coordinates(report_info) = ~lon + lat
projection(report_info) <- CRS("+proj=lonlat +ellps=WGS84")

leaflet() %>% addCircles(data=grain_products_geo, color="blue", popup=report_info$her_no) %>% addCircles(data=report_info, color="gray")%>% addTiles()