Our data is scraped from a website with timing data from a triathlon held in Sept. 2013 in Westchester County. 1089 athletes competed in 28 categories of age, gender, and number.
In the middle group of finishers, between-runner times are smaller than at both ends. Taken together, our categories of runners finishing our race look like this:
We choose Males 45-49, Males 40-44, Males 50-54 and Collegiate Females to focus our analysis. Within our larger process, there are smaller processes that can be modeled independently from each other or compared to each other.
| type | mean per minute |
|---|---|
| 4 categories | 6.8272 |
| 45-49 Male | 1.9976 |
| 50-54 Male | 0.9356 |
| Collegiate Female | 1.4413 |
| 40-44 Male | 2.4527 |
We take a look at the wait times of our 4 processes, along with a histogram for each. A more complicated and more accurate model could take the mean of each time segment for each category. We can see 20 distinct possible means in the 4 graphs. We will look at another way to model the mean at the end.
| actual | expected | difference | |
|---|---|---|---|
| 50-54 Male before 45-49 Male | 0.39800 | 0.31897 | 0.07903 |
| Collegiate Female before 45-49 Male | 0.42750 | 0.41912 | 0.00838 |
| 40-44 Male before 45-49 Male | 0.64175 | 0.55114 | 0.09061 |
| Collegiate Female before 50-54 Male | 0.60025 | 0.60638 | -0.00613 |
| 40-44 Male before 50-54 Male | 0.75525 | 0.72388 | 0.03137 |
| 40-44 Male before Collegiate Female | 0.70225 | 0.62987 | 0.07238 |
Our estimates would have predicted fairly well which runner would arrive next. It is not perfect, but roughly holds. If we used a heterogeneous model instead of a homogeneous model, we might be able to improve our prediction.
Our mean arrival number appears to potentially have a gamma distribution. When we have a Poisson distribution where the \(\lambda\) varies according to a gamma distribution, it is the equivalent to a negative binomial distribution. The negative binomial distribution can also be used to calculate the probability a set number of events of one type will happen before a number of another type.
The following is preserved for future analysis. After the first set of numbers proved especially useful, we concentrated on the runner timing analysis:
#------------------------------------------------------------------------------------------------
#Data was scraped from the procon.org website. We looked at three controversial topics and created corpora to find the most likely words in each set of opinions. We could count the occurrence and frequency in a word among opinions and look at the likelihood of one word over another. We could look at the arrival of a word within words in the same way as we might look at arrival of a person.
#------------------------------------------------------------------------------------------------
procon.marijuana<-unlist(readLines('https://medicalmarijuana.procon.org/view.answers.php?questionID=001325'))
procon.marijuana.pro<-procon.marijuana[c(1202:1607)]
procon.marijuana.con<-procon.marijuana[c(1619:1972)]
procon.marijuana.con<-str_remove_all (procon.marijuana.con,"<[^>]+>")
procon.marijuana.pro<-str_remove_all (procon.marijuana.pro,"<[^>]+>")
#------------------------------------------------------------------------------------------------
procon.marijuana.pro<-(tolower(procon.marijuana.pro))
marijuana_pro_corpus<-Corpus(VectorSource(procon.marijuana.pro))
marijuana_pro_corpus <- tm_map(marijuana_pro_corpus, removeWords, stopwords("english"))
marijuana_pro_tdm<-TermDocumentMatrix(marijuana_pro_corpus)
marijuana_pro_top_terms<-findFreqTerms(marijuana_pro_tdm, lowfreq=15, highfreq=Inf)
marijuana_pro_top_terms
procon.marijuana.con<-(tolower(procon.marijuana.con))
marijuana_con_corpus<-Corpus(VectorSource(procon.marijuana.con))
marijuana_con_corpus <- tm_map(marijuana_con_corpus, removeWords, stopwords("english"))
marijuana_con_tdm<-TermDocumentMatrix(marijuana_con_corpus)
marijuana_con_top_terms<-findFreqTerms(marijuana_con_tdm, lowfreq=15, highfreq=Inf)
marijuana_con_top_terms
#------------------------------------------------------------------------------------------------
procon.corporatetax<-unlist(readLines('https://corporatetax.procon.org'))
procon.corporatetax.pro<-procon.corporatetax[c(1204:1368)]
procon.corporatetax.con<-procon.corporatetax[c(1366:1435)]
procon.corporatetax.con<-str_remove_all (procon.corporatetax.con,"<[^>]+>")
procon.corporatetax.pro<-str_remove_all (procon.corporatetax.pro,"<[^>]+>")
procon.corporatetax.pro<-(tolower(procon.corporatetax.pro))
corporatetax_pro_corpus<-Corpus(VectorSource(procon.corporatetax.pro))
corporatetax_pro_corpus <- tm_map(corporatetax_pro_corpus, removeWords, stopwords("english"))
corporatetax_pro_tdm<-TermDocumentMatrix(corporatetax_pro_corpus)
corporatetax_pro_top_terms<-findFreqTerms(corporatetax_pro_tdm, lowfreq=15, highfreq=Inf)
corporatetax_pro_top_terms
procon.corporatetax.con<-(tolower(procon.corporatetax.con))
corporatetax_con_corpus<-Corpus(VectorSource(procon.corporatetax.con))
corporatetax_con_corpus <- tm_map(corporatetax_con_corpus, removeWords, stopwords("english"))
corporatetax_con_tdm<-TermDocumentMatrix(corporatetax_con_corpus)
corporatetax_con_top_terms<-findFreqTerms(corporatetax_con_tdm, lowfreq=10, highfreq=Inf)
corporatetax_con_top_terms
#------------------------------------------------------------------------------------------------
procon.gold<-unlist(readLines('https://gold-standard.procon.org/'))
procon.gold.pro<-procon.gold[c(1204:1274)]
procon.gold.con<-procon.gold[c(1272:1344)]
procon.gold.con<-str_remove_all (procon.gold.con,"<[^>]+>")
procon.gold.pro<-str_remove_all (procon.gold.pro,"<[^>]+>")
procon.gold.pro<-(tolower(procon.gold.pro))
gold_pro_corpus<-Corpus(VectorSource(procon.gold.pro))
gold_pro_corpus <- tm_map(gold_pro_corpus, removeWords, stopwords("english"))
gold_pro_tdm<-TermDocumentMatrix(gold_pro_corpus)
gold_pro_top_terms<-findFreqTerms(gold_pro_tdm, lowfreq=10, highfreq=Inf)
gold_pro_top_terms
procon.gold.con<-(tolower(procon.gold.con))
gold_con_corpus<-Corpus(VectorSource(procon.gold.con))
gold_con_corpus <- tm_map(gold_con_corpus, removeWords, stopwords("english"))
gold_con_tdm<-TermDocumentMatrix(gold_con_corpus)
gold_con_top_terms<-findFreqTerms(gold_con_tdm, lowfreq=10, highfreq=Inf)
gold_con_top_terms
#------------------------------------------------------------------------------------------------
procon.gold.pro<-unlist(procon.gold.pro)
procon.gold.con<-unlist(procon.gold.con)
joint.corpus<-c(procon.gold.pro,procon.gold.con, recursive = FALSE)
joint.corpus<-Corpus(VectorSource(joint.corpus))
joint.corpus <- tm_map(joint.corpus, removeWords, stopwords("english"))
joint.corpus_tdm<-TermDocumentMatrix(joint.corpus)
joint.corpus_terms<-findFreqTerms(joint.corpus_tdm, lowfreq=10, highfreq=Inf)
joint.corpus_terms
str_locate_all(procon.gold.pro,"debt")
procon.gold.pro[28]
procon.gold.pro[23]
procon.gold.pro[24]
procon.gold.pro[25]
print("here")
procon.gold.pro<- stri_join(procon.gold.pro, sep = " ", collapse = NULL)
procon.gold.pro<-paste(procon.gold.pro[23],procon.gold.pro[28],procon.gold.pro[14])
procon.gold.pro