Stage of crop processing by sample

Jones (1990) published an ethnographic study of cereal crop processing stages in Greece to provide stronger methods of inferring the stage of crop processing represented by an archaeological sample. These categories are winnowing, coarse sieve product, fine sieve product, and cleaned grain product. Although Jones’ other publications on this topic (1984, 1987) utilize regression and statistical models, this article provides interpretation guidelines based on the percentages of grains, rachis internodes, and weeds present in the sample. Other parts of the plant such as chaff were excluded from the model due to its variable preservation in archaeological contexts.

The data loading process is masked here, but is identical to the chunk visibile here

library(sqldf)
library(ggplot2)

Constructing the query

I decided to do a two part SQL query. The first query selects data from the archaeobotanical finds table based on the type of plant (weed and cereal in this case). Information about plants was stored separately in a “planttype” table to avoid the redundancy of typing “cereal” for every entry of wheat and barley. I was excluded cereal fragment counts from this first table because Jones’ model counts a single grain based on the presence of an embryo.

The second query draws from the first table and creates a sum of all cereals and weeds identified. The samples analyzed using Jones’ model need to have a high number of cereals and weeds present in order for the conclusions to be valid. The original publication listed counts of at least 300 total cereals and weeds, but I would have had very little material to work with if I had followed this guideline. I noticed the totals for my samples dropped dramatically from about 75 to 30 and decided to use this as my cut for excluding the samples that were far too small for meaningful analysis.

cereals_weeds <- sqldf("SELECT finds.*, plant_type FROM finds LEFT JOIN planttype ON (finds.family=planttype.family AND finds.genus=planttype.genus AND finds.species=planttype.species) WHERE plant_part != 'frag' AND (planttype.plant_type = 'weed' OR planttype.plant_type= 'cereal')")
sample_cereal_sum <- sqldf("SELECT sampling.sample_id, SUM(cereals_weeds.count) AS finds_count FROM sampling LEFT JOIN cereals_weeds ON (sampling.sample_id=cereals_weeds.sample_id) GROUP BY sampling.sample_id ORDER BY SUM(cereals_weeds.count) DESC")

head(sample_cereal_sum, n=15)
##    sample_id finds_count
## 1       S044        2000
## 2       S047        1578
## 3       S067        1226
## 4       S068         887
## 5       S057         310
## 6       S063         173
## 7       S058         123
## 8       S095          95
## 9       S056          90
## 10      S099          79
## 11      S064          49
## 12      S060          44
## 13      S065          40
## 14      S098          35
## 15      S045          33

Jones’ model combines all cereal species into a single summed percentage for grains and rachis internodes and a single percentage for crop weeds. The next query provides a sum of all plant parts present for each sample. I will then select the grain, rachis, and weed seed counts for further anlaysis.

sample_cereal_sum <- subset(sample_cereal_sum, finds_count > 75)

sample_plantpart <- sqldf("SELECT sample_cereal_sum.sample_id, plant_part, SUM(count) AS plantpartsum FROM sample_cereal_sum LEFT JOIN cereals_weeds ON (sample_cereal_sum.sample_id=cereals_weeds.sample_id) GROUP BY plant_part, sample_cereal_sum.sample_id ORDER BY sample_cereal_sum.sample_id")

head(sample_plantpart, n=15)
##    sample_id plant_part plantpartsum
## 1       S044      grain         1993
## 2       S044       seed            7
## 3       S047      grain           18
## 4       S047       seed         1560
## 5       S056      grain           60
## 6       S056     rachis            2
## 7       S056       seed           28
## 8       S057      grain          164
## 9       S057     rachis            3
## 10      S057       seed          143
## 11      S058      grain           99
## 12      S058       seed           24
## 13      S063      grain          138
## 14      S063       seed           35
## 15      S067      chaff           10
#I later derive the percentages of each component of the sample by dividing the sum for the plant part by the totals from the sample_cereal_sum table, so this merge brings the totals from before into my new table
cereal_totals <- merge(sample_plantpart, sample_cereal_sum, by="sample_id")

head(cereal_totals, n=15)
##    sample_id plant_part plantpartsum finds_count
## 1       S044      grain         1993        2000
## 2       S044       seed            7        2000
## 3       S047      grain           18        1578
## 4       S047       seed         1560        1578
## 5       S056      grain           60          90
## 6       S056     rachis            2          90
## 7       S056       seed           28          90
## 8       S057      grain          164         310
## 9       S057     rachis            3         310
## 10      S057       seed          143         310
## 11      S058      grain           99         123
## 12      S058       seed           24         123
## 13      S063      grain          138         173
## 14      S063       seed           35         173
## 15      S067      chaff           10        1226
#select only the rachis, grain, and (weed) seeds
cereal_totals <- cereal_totals[cereal_totals$plant_part == 'rachis'| cereal_totals$plant_part=='grain' | cereal_totals$plant_part=='seed',]

#calculate the percent presence of the rachis, grain and weed components of the sample
cereal_totals$percent <- cereal_totals$plantpartsum/cereal_totals$finds_count

Data wrangling

The table resulting from the above steps lists grain, rachis, and seed counts rows rather than columns in my data table. It will be much easier to interpret the samples based on grain, rachis, and weed content if each sample had only one row and the grain, rachis, and weed percentages were listed as columns within the sample’s row. I decided the easiest work around would be to create a subset based on the “plant_part” value and change the generic “percent” column to contain “grain”, “rachis”, or “seed.” When I merge all of these tables back together again, the three values will appear as separate columns for each sample. This also helps me get around the problem that many samples did not have all three components present (so creating a function that worked based on a multiple of 3 to select the different components would not work)

grain_sub <- subset(cereal_totals, plant_part=='grain')
rachis_sub <- subset(cereal_totals, plant_part=='rachis')
weed_sub <- subset(cereal_totals, plant_part=='seed')

#change the name of the percentage column
names(grain_sub)[5] <- "grain_per"
names(rachis_sub)[5] <- "rachis_per"
names(weed_sub)[5] <- "weed_per"

#remove unnecessary columns from the data frames
grain_sub <- grain_sub[,c(1,5)]
rachis_sub <- rachis_sub[,c(1,5)]
weed_sub <- weed_sub[,c(1,5)]

#merge these back into a single table based on sample id
samples_per <- merge(grain_sub, rachis_sub, by='sample_id', all=TRUE)
samples_per <- merge(samples_per, weed_sub, by='sample_id', all=TRUE)

#set empty values to 0 because the steps below do not accept NA values
samples_per[is.na(samples_per)] <- 0

Interpretation

I selected samples based on the characteristics outlined in Jones 1990 for each sample. The conditional statements fill the “interpretation” value for a given if the columns meet the parameters.

samples_per$interpretation <- NA

"winnow" -> samples_per[samples_per$rachis_per > 0.5 & samples_per$weed_per > samples_per$grain_per, "interpretation"]
"coarse_sieve" -> samples_per[samples_per$rachis_per > 0.3, "interpretation"]
"fine_sieve" -> samples_per[samples_per$weed_per > 0.3, "interpretation"]
"clean_prod" -> samples_per[samples_per$grain_per > 0.75, "interpretation"]

head(samples_per)
##   sample_id  grain_per  rachis_per  weed_per interpretation
## 1      S044 0.99650000 0.000000000 0.0035000     clean_prod
## 2      S047 0.01140684 0.000000000 0.9885932     fine_sieve
## 3      S056 0.66666667 0.022222222 0.3111111     fine_sieve
## 4      S057 0.52903226 0.009677419 0.4612903     fine_sieve
## 5      S058 0.80487805 0.000000000 0.1951220     clean_prod
## 6      S063 0.79768786 0.000000000 0.2023121     clean_prod

Joining the sample interpretations to site data

The next sql query is long but is simply pulling the site key, latitude, and longitude from the other data tables. The sample_id column links to the latitude and longitude values via the occupation_data table, so the select statement must incorporate this table in the query. I am also interested in comparing crop processing stages to phase of occupation to look for evidence of change over time in the concentration of crop processing activities, so I included the phase column in the plot.

interpretation_geo_phase <- sqldf("SELECT report_info.her_no, report_info.her_cit, lat, lon, sampling.sample_id, interpretation, phase FROM report_info JOIN occupation_data ON (report_info.her_no = occupation_data.her_no AND report_info.her_cit=occupation_data.her_cit) JOIN sampling ON (occupation_data.her_no= sampling.her_no AND occupation_data.her_cit= sampling.her_cit AND occupation_data.context=sampling.context) JOIN samples_per ON (sampling.sample_id = samples_per.sample_id)")

ggplot(interpretation_geo_phase, aes(x=phase, fill=interpretation))+geom_bar(stat="count", position="dodge") +scale_fill_discrete(name="Legend", labels=c("Clean Products", "Fine Sieve"))

The final steps of this code separate the sample interpretations into new tables and provide counts of the number of samples at the site for the interpretation. The samples that I analyzed only had fine_sieve and clean products present based on the parameters above, but this approach could be expanded to include the other crop processing stages if present. I will then map the different processing stages as layers with the marker size corresponding to the number of samples from that stage at a given site. By mapping them as separate layers, I will be able to visualize sites for which multiple stages of crop processing are present.

In the future when I have an expanded data set with more samples available for analysis, I plan to separate the data into separate tables by phase and repeat the analysis steps below for each phase. Because only ten samples from three sites were included in this analysis, separate analysis for each phase did not seem productive.

interpretation_geo <- subset(interpretation_geo_phase, select=c("her_no", "her_cit", "lat", "lon", "sample_id", "interpretation"))

fine_sieve_geo <- subset(interpretation_geo, interpretation=='fine_sieve')
clean_prod_geo <- subset(interpretation_geo, interpretation=='clean_prod')

#counting the samples by site requires a single column rather than multi-column identifier for each site
fine_sieve_geo$id <- do.call(paste, c(fine_sieve_geo[c("her_no", "her_cit")], sep = "."))
count <- rle(sort(fine_sieve_geo$id))
#count the number of fine sieve products for each site
fine_sieve_geo$count <- count[[1]][ match( fine_sieve_geo$id , count[[2]] ) ]
#remove duplicate columns now that site count has been established
fine_sieve_geo <- fine_sieve_geo[!duplicated(fine_sieve_geo$id),]

#repeat process for clean products
clean_prod_geo$id <- do.call(paste, c(clean_prod_geo[c("her_no", "her_cit")], sep = "."))
count <- rle(sort(clean_prod_geo$id))
clean_prod_geo$count <- count[[1]][ match( clean_prod_geo$id , count[[2]] ) ]
clean_prod_geo <- clean_prod_geo[!duplicated(clean_prod_geo$id),]

#export results for mapping
#write.csv(fine_sieve_geo, "~/Grad Year 3/Advanced Data Structures/output/fine_sieve_geo.csv")
#write.csv(clean_prod_geo, "~/Grad Year 3/Advanced Data Structures/output/clean_prod_geo.csv")