Jones (1990) published an ethnographic study of cereal crop processing stages in Greece to provide stronger methods of inferring the stage of crop processing represented by an archaeological sample. These categories are winnowing, coarse sieve product, fine sieve product, and cleaned grain product. Although Jones’ other publications on this topic (1984, 1987) utilize regression and statistical models, this article provides interpretation guidelines based on the percentages of grains, rachis internodes, and weeds present in the sample. Other parts of the plant such as chaff were excluded from the model due to its variable preservation in archaeological contexts.
The data loading process is masked here, but is identical to the chunk visibile here
library(sqldf)
library(ggplot2)
I decided to do a two part SQL query. The first query selects data from the archaeobotanical finds table based on the type of plant (weed and cereal in this case). Information about plants was stored separately in a “planttype” table to avoid the redundancy of typing “cereal” for every entry of wheat and barley. I was excluded cereal fragment counts from this first table because Jones’ model counts a single grain based on the presence of an embryo.
The second query draws from the first table and creates a sum of all cereals and weeds identified. The samples analyzed using Jones’ model need to have a high number of cereals and weeds present in order for the conclusions to be valid. The original publication listed counts of at least 300 total cereals and weeds, but I would have had very little material to work with if I had followed this guideline. I noticed the totals for my samples dropped dramatically from about 75 to 30 and decided to use this as my cut for excluding the samples that were far too small for meaningful analysis.
cereals_weeds <- sqldf("SELECT finds.*, plant_type FROM finds LEFT JOIN planttype ON (finds.family=planttype.family AND finds.genus=planttype.genus AND finds.species=planttype.species) WHERE plant_part != 'frag' AND (planttype.plant_type = 'weed' OR planttype.plant_type= 'cereal')")
sample_cereal_sum <- sqldf("SELECT sampling.sample_id, SUM(cereals_weeds.count) AS finds_count FROM sampling LEFT JOIN cereals_weeds ON (sampling.sample_id=cereals_weeds.sample_id) GROUP BY sampling.sample_id ORDER BY SUM(cereals_weeds.count) DESC")
head(sample_cereal_sum, n=15)
## sample_id finds_count
## 1 S044 2000
## 2 S047 1578
## 3 S067 1226
## 4 S068 887
## 5 S057 310
## 6 S063 173
## 7 S058 123
## 8 S095 95
## 9 S056 90
## 10 S099 79
## 11 S064 49
## 12 S060 44
## 13 S065 40
## 14 S098 35
## 15 S045 33
Jones’ model combines all cereal species into a single summed percentage for grains and rachis internodes and a single percentage for crop weeds. The next query provides a sum of all plant parts present for each sample. I will then select the grain, rachis, and weed seed counts for further anlaysis.
sample_cereal_sum <- subset(sample_cereal_sum, finds_count > 75)
sample_plantpart <- sqldf("SELECT sample_cereal_sum.sample_id, plant_part, SUM(count) AS plantpartsum FROM sample_cereal_sum LEFT JOIN cereals_weeds ON (sample_cereal_sum.sample_id=cereals_weeds.sample_id) GROUP BY plant_part, sample_cereal_sum.sample_id ORDER BY sample_cereal_sum.sample_id")
head(sample_plantpart, n=15)
## sample_id plant_part plantpartsum
## 1 S044 grain 1993
## 2 S044 seed 7
## 3 S047 grain 18
## 4 S047 seed 1560
## 5 S056 grain 60
## 6 S056 rachis 2
## 7 S056 seed 28
## 8 S057 grain 164
## 9 S057 rachis 3
## 10 S057 seed 143
## 11 S058 grain 99
## 12 S058 seed 24
## 13 S063 grain 138
## 14 S063 seed 35
## 15 S067 chaff 10
#I later derive the percentages of each component of the sample by dividing the sum for the plant part by the totals from the sample_cereal_sum table, so this merge brings the totals from before into my new table
cereal_totals <- merge(sample_plantpart, sample_cereal_sum, by="sample_id")
head(cereal_totals, n=15)
## sample_id plant_part plantpartsum finds_count
## 1 S044 grain 1993 2000
## 2 S044 seed 7 2000
## 3 S047 grain 18 1578
## 4 S047 seed 1560 1578
## 5 S056 grain 60 90
## 6 S056 rachis 2 90
## 7 S056 seed 28 90
## 8 S057 grain 164 310
## 9 S057 rachis 3 310
## 10 S057 seed 143 310
## 11 S058 grain 99 123
## 12 S058 seed 24 123
## 13 S063 grain 138 173
## 14 S063 seed 35 173
## 15 S067 chaff 10 1226
#select only the rachis, grain, and (weed) seeds
cereal_totals <- cereal_totals[cereal_totals$plant_part == 'rachis'| cereal_totals$plant_part=='grain' | cereal_totals$plant_part=='seed',]
#calculate the percent presence of the rachis, grain and weed components of the sample
cereal_totals$percent <- cereal_totals$plantpartsum/cereal_totals$finds_count
The table resulting from the above steps lists grain, rachis, and seed counts rows rather than columns in my data table. It will be much easier to interpret the samples based on grain, rachis, and weed content if each sample had only one row and the grain, rachis, and weed percentages were listed as columns within the sample’s row. I decided the easiest work around would be to create a subset based on the “plant_part” value and change the generic “percent” column to contain “grain”, “rachis”, or “seed.” When I merge all of these tables back together again, the three values will appear as separate columns for each sample. This also helps me get around the problem that many samples did not have all three components present (so creating a function that worked based on a multiple of 3 to select the different components would not work)
grain_sub <- subset(cereal_totals, plant_part=='grain')
rachis_sub <- subset(cereal_totals, plant_part=='rachis')
weed_sub <- subset(cereal_totals, plant_part=='seed')
#change the name of the percentage column
names(grain_sub)[5] <- "grain_per"
names(rachis_sub)[5] <- "rachis_per"
names(weed_sub)[5] <- "weed_per"
#remove unnecessary columns from the data frames
grain_sub <- grain_sub[,c(1,5)]
rachis_sub <- rachis_sub[,c(1,5)]
weed_sub <- weed_sub[,c(1,5)]
#merge these back into a single table based on sample id
samples_per <- merge(grain_sub, rachis_sub, by='sample_id', all=TRUE)
samples_per <- merge(samples_per, weed_sub, by='sample_id', all=TRUE)
#set empty values to 0 because the steps below do not accept NA values
samples_per[is.na(samples_per)] <- 0
I selected samples based on the characteristics outlined in Jones 1990 for each sample. The conditional statements fill the “interpretation” value for a given if the columns meet the parameters.
samples_per$interpretation <- NA
"winnow" -> samples_per[samples_per$rachis_per > 0.5 & samples_per$weed_per > samples_per$grain_per, "interpretation"]
"coarse_sieve" -> samples_per[samples_per$rachis_per > 0.3, "interpretation"]
"fine_sieve" -> samples_per[samples_per$weed_per > 0.3, "interpretation"]
"clean_prod" -> samples_per[samples_per$grain_per > 0.75, "interpretation"]
head(samples_per)
## sample_id grain_per rachis_per weed_per interpretation
## 1 S044 0.99650000 0.000000000 0.0035000 clean_prod
## 2 S047 0.01140684 0.000000000 0.9885932 fine_sieve
## 3 S056 0.66666667 0.022222222 0.3111111 fine_sieve
## 4 S057 0.52903226 0.009677419 0.4612903 fine_sieve
## 5 S058 0.80487805 0.000000000 0.1951220 clean_prod
## 6 S063 0.79768786 0.000000000 0.2023121 clean_prod
The next sql query is long but is simply pulling the site key, latitude, and longitude from the other data tables. The sample_id column links to the latitude and longitude values via the occupation_data table, so the select statement must incorporate this table in the query. I am also interested in comparing crop processing stages to phase of occupation to look for evidence of change over time in the concentration of crop processing activities, so I included the phase column in the plot.
interpretation_geo_phase <- sqldf("SELECT report_info.her_no, report_info.her_cit, lat, lon, sampling.sample_id, interpretation, phase FROM report_info JOIN occupation_data ON (report_info.her_no = occupation_data.her_no AND report_info.her_cit=occupation_data.her_cit) JOIN sampling ON (occupation_data.her_no= sampling.her_no AND occupation_data.her_cit= sampling.her_cit AND occupation_data.context=sampling.context) JOIN samples_per ON (sampling.sample_id = samples_per.sample_id)")
ggplot(interpretation_geo_phase, aes(x=phase, fill=interpretation))+geom_bar(stat="count", position="dodge") +scale_fill_discrete(name="Legend", labels=c("Clean Products", "Fine Sieve"))
The final steps of this code separate the sample interpretations into new tables and provide counts of the number of samples at the site for the interpretation. The samples that I analyzed only had fine_sieve and clean products present based on the parameters above, but this approach could be expanded to include the other crop processing stages if present. I will then map the different processing stages as layers with the marker size corresponding to the number of samples from that stage at a given site. By mapping them as separate layers, I will be able to visualize sites for which multiple stages of crop processing are present.
In the future when I have an expanded data set with more samples available for analysis, I plan to separate the data into separate tables by phase and repeat the analysis steps below for each phase. Because only ten samples from three sites were included in this analysis, separate analysis for each phase did not seem productive.
interpretation_geo <- subset(interpretation_geo_phase, select=c("her_no", "her_cit", "lat", "lon", "sample_id", "interpretation"))
fine_sieve_geo <- subset(interpretation_geo, interpretation=='fine_sieve')
clean_prod_geo <- subset(interpretation_geo, interpretation=='clean_prod')
#counting the samples by site requires a single column rather than multi-column identifier for each site
fine_sieve_geo$id <- do.call(paste, c(fine_sieve_geo[c("her_no", "her_cit")], sep = "."))
count <- rle(sort(fine_sieve_geo$id))
#count the number of fine sieve products for each site
fine_sieve_geo$count <- count[[1]][ match( fine_sieve_geo$id , count[[2]] ) ]
#remove duplicate columns now that site count has been established
fine_sieve_geo <- fine_sieve_geo[!duplicated(fine_sieve_geo$id),]
#repeat process for clean products
clean_prod_geo$id <- do.call(paste, c(clean_prod_geo[c("her_no", "her_cit")], sep = "."))
count <- rle(sort(clean_prod_geo$id))
clean_prod_geo$count <- count[[1]][ match( clean_prod_geo$id , count[[2]] ) ]
clean_prod_geo <- clean_prod_geo[!duplicated(clean_prod_geo$id),]
#export results for mapping
#write.csv(fine_sieve_geo, "~/Grad Year 3/Advanced Data Structures/output/fine_sieve_geo.csv")
#write.csv(clean_prod_geo, "~/Grad Year 3/Advanced Data Structures/output/clean_prod_geo.csv")