Sampling words from Darwin’s Origin of Species

Load libraries

library(mosaic)
library(dplyr)
library(googledrive)
library(googlesheets4)
library(flextable) 
set_flextable_defaults(fonts_ignore=TRUE)

Read in new data from 2023

darwin<-read_sheet("https://docs.google.com/spreadsheets/d/1xNicDAVfd0QhNVuXhAgkBVrdUjveTTF12gBIZO9UsUY/edit?usp=sharing" )
darwin<-darwin[-1,]

Read in data from last year and select observations recorded in 2024

darwinmix<-read_sheet("https://docs.google.com/spreadsheets/d/1shgSkLAEsfhax5SINUAgxs6v8AEp90cn2oWW4CMXMXM/edit?usp=sharing")

## ✔ Reading from "DarwinOoS2023 (Responses)".

## ✔ Range 'Form Responses 1'.

Keep only those cases where the timestamp is from 2024

ind <- substr(darwinmix$Timestamp, 1, 4) =="2024"
darwinmix<-darwinmix[ind==TRUE,]

Now, merge with responses supplied with this years new form

darwin<-rbind(darwin, darwinmix)

Rename some of the variables to make them easier to work with

names(darwin)[3]<-"mean.char"  
names(darwin)[2]<-"Sampling.method"

Read in the passage from Darwin so we can calculate the true mean number of characters

oas<-read.csv("data/Darwin.csv")

True population mean - this is what we are trying to estimate!

(mean.nchar<-mean(~nchar, data=oas))

## [1] 4.936652

Distribution of sample means from students’ samples

gf_histogram(~mean.char, data=darwin, xlab="Mean Number of Words") %>% 
  gf_vline(xintercept=~4.93, col="red") %>% 
  gf_vline(xintercept=~mean(~mean.char, data=darwin, na.rm=TRUE), col = "blue")

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

Mean (of the means) for the class

mean(~mean.char, data=darwin, na.rm = TRUE)

## [1] 6.246154

We see that, on average, the mean number of words in the sample taken by students is higher than the population mean (4.93 words) Bias = the difference between the blue and red lines

(bias<-mean(~mean.char, data=darwin, na.rm = TRUE) - 4.93)

## [1] 1.316154

# Methods used by students
methods <- darwin %>% select(Sampling.method) %>% data.frame() 
flextable::flextable(methods, cwidth=60)%>%
  width(width=6)

Sampling.method
I chose 15 words at random and took the average of the number of letters in each word.
I picked a word in the middle of each line in the first 10 of the 12 rows.
I picked my 10 words from the big ideas from the passage ... trying to be as broad as possible
I picked several long (5-9 letters) words to account for vocabulary and I also picked several short words (2-4 letters) to account for connective words like "as," "to," etc.
I chose my 10 words based on the overarching concepts and theories Darwin is trying to get at. This excerpt seems to focus heavily on domesticated vs naturally grown plants and the variations that may ensue as a result of the different conditions in which they are raised. My chosen words reflect these ideas.
I choose 10 words that I think were related to nature and wildlife.
I chose words that I saw were repeated more than once throughout the passage like "plants, animals, variety" and I also chose words that were significant to the passage such as "climates, diversity, organic". I chose these all particularly because they looked to be about the same length in the number of letters they had
I entered the passage into a word counter to get how many total words there are (221). I then divided 221 by 10 (=22.1), and picked every 22nd in the passage. I added up the letters in each selected word, divided the total number of letters by 10, and got 4.1.
First, I chose words that i've heard over twice. However, after doing this I didn’t come out with 10 words. So I went back and chose conjugations that appeared a lot like “if."
I chose my 10 words by first scanning the passage for length. I then tried to comb the text for words of certain lengths, and when I saw lengths of words that were more common I included more of those. I also tried to be conscious of adding words that were shorter like "as" and "or," which are otherwise easily overlooked for longer words.
Chose 10 words relatively evenly spaced throughout the paragraph, attempting to ignore word length but probably failing
I chose by closing my eyes and moving my mouse to a random location and choosing the word that my mouse hovered
Woke up late, jetlag
Picked one random word from the first ten lines
I generally looked for some words that were a range of lengths.
I tried to sample as randomly as possible by closing my eyes and moving my mouse to a random location without looking
I chose one random word from each line of text and randomly chose two lines to skip as the text has twelve lines of words.
Put in google doc, highlighted 10 words spaced relatively equally throughout the paragraph, tried not to look at word length but probably accidentally selected a variety of lengths
and, to, also, plants, that, our, animals, when, this, vary
I randomly selected a line of 10 words in a row from the middle of the passage.
I used google docs' word count feature to find the number of words, and divided it by 10. Then I used a random number generator between 1 and 221 (the number of words) to choose my starting point, from which I circled every 22nd word until I had a sample of 10 words.
I choose some words that it seemed like a biologist would use frequently and then some words that are generally repeated in writing.
I took the total number of words in the passage, removing the attribution (221) and then used a google available random number generator to select individual words.
The 10 words I chose were ones that stood out to me as main points that were generally recurring in the paragraph
I chose them based upon apparent length (I did not actually count the letters). I found words that seemed to coincide with the most common length within the population of words.
I looked through the entire passage to see if I could spot any common words or word lengths. Then I took five words that were 4 characters or less from random areas in the passage, I then took another five words that were 5 characters or more from random spots in the passage. For the five shorter words I picked one word that was 2 characters, one with 3 characters, and 3 with 4 words. The longer words were picked more randomly and the length did not matter besides that they were longer than 5 characters.
I chose my 10 words based on the overarching concepts and theories Darwin is trying to get at. This excerpt seems to focus heavily on domesticated vs naturally grown plants and the variations that may ensue as a result of the different conditions in which they are raised. My chosen words reflect these ideas.
I chose the words that stuck out to me the most in the sentence or words that were repeated a lot.
I took the passage and put the total amount of words into a random number generator, then found the word that corresponded with that number. Rerolling if the same number was pulled twice.
every 15th word, then chosen randomly from 15 words
Mixture of random sampling and choosing words that commonly showed up
I copy-pasted the passage into Word to find out that it had 221 words. I then used a random number generator (I used ChatGPT, hopefully that is truly random) to pick 10 different numbers between 221. I then counted the letters from each of those 10 words where the first word of the paragraph is word 1 and the last word in the paragraph is word 221. The numbers I got were 10, 70, 196, 194, 5, 176, 28, 150, 43, and 145. I added the number of letters each of those words were and divided by 10 to get 5.2 letters on average.
I took 1 word from every 2 lines and that got me 7 words. The final 3 words were loosely based on thirds of the paper but I did not choose the same word at any point.
animals, plants, and, also, vary, when, this, that, still, species
I chose the words that I believe to have appeared the most or were most relevant to the theme/summary of his Origin of Species.
I chose a random string 10 words
I honestly did a random sample since it seemed from skimming the paragraph that there was a pretty good mix of numbers of letters per word.
I closed my eyes and pointed at ten random words in the article.
Random selection from ten different lines

Add on a “year” column to this year’s data

darwin <- darwin %>% mutate(year = "2024")

# ## Compare to data from 2020, 2022, 2023

Read in data from past years

darwinold<-read.csv("data/Darwin2023.csv")

For some reason, the columns are arranged differently in the old and new data sets. We can rearrange the columns using the select functilon

darwin <- darwin %>% select(Timestamp, mean.char, Sampling.method, year)
darwinold <- darwinold %>% select(Timestamp, mean.char, Sampling.method, year)
darwinall <- rbind(darwin, darwinold)

Create a multi-panel plot to illustrate the results from each year

gf_histogram(~mean.char | year, data=darwinall, xlab="Mean Number of Words") %>% 
  gf_vline(xintercept=~4.93)

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

mean(~mean.char | year, data=darwinall, na.rm = TRUE)

##     2020     2022     2023     2024 
## 5.854107 5.607442 5.995349 6.246154

(mean.nchar<-mean(~nchar, data=oas))

## [1] 4.936652

Random sampling

With random sampling, however, we should get estimates that are, on average, equal to the population mean. Lets explore this by taking 10,000 random samples and computing the mean number of words in each sample!

randomsamps<-do(10000)*{
  samp.char<-sample(oas, 10)
  mean(~nchar, data=samp.char)
  # Alternative way to accomplish the same thing in 1 line of code
  #mean(~nchar, data=sample(oas, 10))
}    
gf_dhistogram(~result, data=randomsamps, xlab="Mean Number of Words",
          main="Random Sampling") %>% gf_vline(xintercept=~4.93)

mean(~result, data=randomsamps)

## [1] 4.94543

## Larger samples -> less variable estimates!

Lets look at what sorts of estimates we would have gotten if we had sampled 50 words instead of 10. We find that our estimates would again be centered around the population mean, but there would be less variability in our estiamtes from sample to sample.

randomsamps<-do(10000)*{
  samp.char<-sample(oas, 50)
  mean(~nchar, data=samp.char)
  # Alternative way to accomplish the same thing in 1 line of code
  #mean(~nchar, data=sample(oas, 10))
}    
gf_dhistogram(~result, data=randomsamps, xlab="Mean Number of Words",
              main="Random Sampling") %>% gf_vline(xintercept=~4.93)

mean(~result, data=randomsamps)

## [1] 4.932008

Write out the data for a future lab

write.csv(darwinall, file="data/Darwin2024.csv", row.names = FALSE)

Sampling words from Darwin’s Origin of Species

John Fieberg

05 September, 2024

Distribution of sample means from students’ samples

Random sampling