Poisson Processes and Real Life Data

A poisson process is a counting process in which the number of events in a given time are random and independent. The number of events can be modeled by a Poisson distribution and the length of time between events can be modeled by an exponential distribution. A Poisson process is homogeneous if the number of events from one time to another is the same.

A Poisson process has three properties that we intend to investigate: thinning, memoryless and use in predicting orders of events.

First, we have to find an area of our process where the mean is relatively stable. We can see from the graph below (left) that the mean number of runners in a given minute is not very stable. It starts small, increases rapidly, then decreases rapidly, then shows a spread out tail distribution. We will revisit this shape at the end. For now, we look for a section of data in the center where we can find a relatively steady stream of finishers.

Our data is scraped from a website with timing data from a triathlon held in Sept. 2013 in Westchester County. 1089 athletes competed in 28 categories of age, gender, and number.

When we look at the time spent waiting for each subsequent finisher, we can see that the race exhibits different behavior at both ends. Some of the earlier finishers were quite spread out. The last finishers were even more spread out.

To get a better look at the shape of the wait time between runners, we look at only values below .6 minutes. We want to see if the outliers at the ends are part of a trend that goes through our data. We can see that, even where it’s relatively stable, the wait times increase the further you are from the median. With 1089 runners, the median was the 545th runner.

We now turn our attention to the thinning of our process. When we thin a Poisson process, we can divide it into different independent groups for analysis. The mean number of runners, or Poisson \(\lambda\) can be divided into proportions of the whole.

To determine which groups and which interval to focus on, we look at the individual categories of runners who were tracked for this race. We want to find those with potentially stable wait times and with a decent number of runners. We also want a category where the runner finishes were spread out throughout the duration of the race.

In the middle group of finishers, between-runner times are smaller than at both ends. Taken together, our categories of runners finishing our race look like this:

To find a relatively homogeneous process, we eliminate the first 200 runners and last 289 runners from our analysis. This is a good example of the difference between most real-life Poisson processes and more simplified processes used for study. Processes can take time to get started and can change when there are few individuals left. A common study problem, waiting at a bank, exhibits this quality. Banks have starts and ends to their day and have periods where they are more busy. An insurance line, however, if it tends to have a similar pool from year to year, could remain stable.

We choose Males 45-49, Males 40-44, Males 50-54 and Collegiate Females to focus our analysis. Within our larger process, there are smaller processes that can be modeled independently from each other or compared to each other.

All Runner Mean and Selected Lambda Values
type mean per minute
4 categories 6.8272
45-49 Male 1.9976
50-54 Male 0.9356
Collegiate Female 1.4413
40-44 Male 2.4527

Now we look at the wait times for our selected sets of runners. Our between runner times are now interpreted as within-class times. Each time is now the wait for the next runner of the same group. Our times are not fully stable, but are far more stable than our beginning data. We can now look at the memoryless property and the expectation of the type of the next arrival.

Poisson processes exhibit a quality called the memoryless quality. In a Poisson process, the expected waiting time is the same no matter how long you’ve already waited. If you take the excess loss (mean of values above a number), it will be the same as the full distribution mean. If you’re waiting in line at the grocery store in a 5 minute line and you’ve waited 5 minutes, you should be helped in 5 more minutes. If you’ve waited 10 minutes, your mean future wait time is 5 minutes. If the process we’ve witnessed shows the memoryless property, the graph below should stay close to .57.

Our process exhibits a quality similar to the memoryless property. However, the difference is a little greater than we expect. This is likely because our real life process has a changing mean, seen in our first graph (at the top).

We take a look at the wait times of our 4 processes, along with a histogram for each. A more complicated and more accurate model could take the mean of each time segment for each category. We can see 20 distinct possible means in the 4 graphs. We will look at another way to model the mean at the end.

Wait Time within Same Group

For Poisson processes, the likelihood of an arrival of type a before b is \(\Huge\frac{\lambda_{a}}{\lambda_{a}+\lambda_{b}}\). We simulate 100 sets of 40 random variables. From each time, we figure out which type of runner will come next. We imagine 4000 spectators show up at random times. What is the likelihood the next runner they see will be of one type or another?

Who is likely to come first?
actual expected difference
50-54 Male before 45-49 Male 0.39800 0.31897 0.07903
Collegiate Female before 45-49 Male 0.42750 0.41912 0.00838
40-44 Male before 45-49 Male 0.64175 0.55114 0.09061
Collegiate Female before 50-54 Male 0.60025 0.60638 -0.00613
40-44 Male before 50-54 Male 0.75525 0.72388 0.03137
40-44 Male before Collegiate Female 0.70225 0.62987 0.07238

Our estimates would have predicted fairly well which runner would arrive next. It is not perfect, but roughly holds. If we used a heterogeneous model instead of a homogeneous model, we might be able to improve our prediction.

Finally, we investigate the shape of our model. We were able to see the memoryless property. We were able to see the ability to thin a larger Poisson into a set of smaller processes. We made reasonable predictions about the likely type of runner to arrive next. The shape of our runner arrival from the original graph is reproduced again, along with a representative gamma distribution.

Our mean arrival number appears to potentially have a gamma distribution. When we have a Poisson distribution where the \(\lambda\) varies according to a gamma distribution, it is the equivalent to a negative binomial distribution. The negative binomial distribution can also be used to calculate the probability a set number of events of one type will happen before a number of another type.

.

The following is preserved for future analysis. After the first set of numbers proved especially useful, we concentrated on the runner timing analysis:

#------------------------------------------------------------------------------------------------
#Data was scraped from the procon.org website.  We looked at three controversial topics and created corpora to find the most likely words in each set of opinions.  We could count the occurrence and frequency in a word among opinions and look at the likelihood of one word over another.  We could look at the arrival of a word within words in the same way as we might look at arrival of a person.
#------------------------------------------------------------------------------------------------
procon.marijuana<-unlist(readLines('https://medicalmarijuana.procon.org/view.answers.php?questionID=001325'))
procon.marijuana.pro<-procon.marijuana[c(1202:1607)]
procon.marijuana.con<-procon.marijuana[c(1619:1972)]
procon.marijuana.con<-str_remove_all (procon.marijuana.con,"<[^>]+>")
procon.marijuana.pro<-str_remove_all (procon.marijuana.pro,"<[^>]+>")
#------------------------------------------------------------------------------------------------
procon.marijuana.pro<-(tolower(procon.marijuana.pro))
marijuana_pro_corpus<-Corpus(VectorSource(procon.marijuana.pro))
marijuana_pro_corpus <- tm_map(marijuana_pro_corpus, removeWords, stopwords("english"))
marijuana_pro_tdm<-TermDocumentMatrix(marijuana_pro_corpus)
marijuana_pro_top_terms<-findFreqTerms(marijuana_pro_tdm, lowfreq=15, highfreq=Inf)
marijuana_pro_top_terms
procon.marijuana.con<-(tolower(procon.marijuana.con))
marijuana_con_corpus<-Corpus(VectorSource(procon.marijuana.con))
marijuana_con_corpus <- tm_map(marijuana_con_corpus, removeWords, stopwords("english"))
marijuana_con_tdm<-TermDocumentMatrix(marijuana_con_corpus)
marijuana_con_top_terms<-findFreqTerms(marijuana_con_tdm, lowfreq=15, highfreq=Inf)
marijuana_con_top_terms
#------------------------------------------------------------------------------------------------
procon.corporatetax<-unlist(readLines('https://corporatetax.procon.org'))
procon.corporatetax.pro<-procon.corporatetax[c(1204:1368)]
procon.corporatetax.con<-procon.corporatetax[c(1366:1435)]
procon.corporatetax.con<-str_remove_all (procon.corporatetax.con,"<[^>]+>")
procon.corporatetax.pro<-str_remove_all (procon.corporatetax.pro,"<[^>]+>")
procon.corporatetax.pro<-(tolower(procon.corporatetax.pro))
corporatetax_pro_corpus<-Corpus(VectorSource(procon.corporatetax.pro))
corporatetax_pro_corpus <- tm_map(corporatetax_pro_corpus, removeWords, stopwords("english"))
corporatetax_pro_tdm<-TermDocumentMatrix(corporatetax_pro_corpus)
corporatetax_pro_top_terms<-findFreqTerms(corporatetax_pro_tdm, lowfreq=15, highfreq=Inf)
corporatetax_pro_top_terms
procon.corporatetax.con<-(tolower(procon.corporatetax.con))
corporatetax_con_corpus<-Corpus(VectorSource(procon.corporatetax.con))
corporatetax_con_corpus <- tm_map(corporatetax_con_corpus, removeWords, stopwords("english"))
corporatetax_con_tdm<-TermDocumentMatrix(corporatetax_con_corpus)
corporatetax_con_top_terms<-findFreqTerms(corporatetax_con_tdm, lowfreq=10, highfreq=Inf)
corporatetax_con_top_terms
#------------------------------------------------------------------------------------------------
procon.gold<-unlist(readLines('https://gold-standard.procon.org/'))
procon.gold.pro<-procon.gold[c(1204:1274)]
procon.gold.con<-procon.gold[c(1272:1344)]
procon.gold.con<-str_remove_all (procon.gold.con,"<[^>]+>")
procon.gold.pro<-str_remove_all (procon.gold.pro,"<[^>]+>")
procon.gold.pro<-(tolower(procon.gold.pro))
gold_pro_corpus<-Corpus(VectorSource(procon.gold.pro))
gold_pro_corpus <- tm_map(gold_pro_corpus, removeWords, stopwords("english"))
gold_pro_tdm<-TermDocumentMatrix(gold_pro_corpus)
gold_pro_top_terms<-findFreqTerms(gold_pro_tdm, lowfreq=10, highfreq=Inf)
gold_pro_top_terms
procon.gold.con<-(tolower(procon.gold.con))
gold_con_corpus<-Corpus(VectorSource(procon.gold.con))
gold_con_corpus <- tm_map(gold_con_corpus, removeWords, stopwords("english"))
gold_con_tdm<-TermDocumentMatrix(gold_con_corpus)
gold_con_top_terms<-findFreqTerms(gold_con_tdm, lowfreq=10, highfreq=Inf)
gold_con_top_terms
#------------------------------------------------------------------------------------------------
procon.gold.pro<-unlist(procon.gold.pro)
procon.gold.con<-unlist(procon.gold.con)
joint.corpus<-c(procon.gold.pro,procon.gold.con, recursive = FALSE)
joint.corpus<-Corpus(VectorSource(joint.corpus))
joint.corpus <- tm_map(joint.corpus, removeWords, stopwords("english"))
joint.corpus_tdm<-TermDocumentMatrix(joint.corpus)
joint.corpus_terms<-findFreqTerms(joint.corpus_tdm, lowfreq=10, highfreq=Inf)
joint.corpus_terms
str_locate_all(procon.gold.pro,"debt")
procon.gold.pro[28]
procon.gold.pro[23]
procon.gold.pro[24]
procon.gold.pro[25]
print("here")
procon.gold.pro<- stri_join(procon.gold.pro, sep = " ", collapse = NULL)
procon.gold.pro<-paste(procon.gold.pro[23],procon.gold.pro[28],procon.gold.pro[14])
procon.gold.pro