My name is Phoebe and I drink a lot of tea! In 2025 I decided to track my hot beverage consumption. In this report, I am exploring the question of how my tea-drinking habits vary by day of the week, and whether there are any underlying patterns that my beverage preferences can be attributed to, using a Naive Bayes predictor.

I began by exploring my data a little bit to see if there were any visible patterns and where I drank the most tea and what types of tea I had the most. For Type of Beverage and Location I filtered for n>1 to make the plots easier to look at. I am a worldly woman and there have been a lot of teas I have tried once but never again, so I removed those from my plots for visual purposes.

I filtered my dataset before using the algorithm to make the predictions a little easier. I removed coffee, hot chocolate, and other beverages with variety “N/A” so that I could use variety as a predictor. I also made temperature, caffeine content, day, and month factors so they could be used as predictors.

hotbev_filtered <- hotbev|>
  filter(variety!="N/A")

#let's make some dummy variables since all of mine are qualitative!
hotbev_filtered$variety <- as.factor(hotbev_filtered$variety)

hotbev_filtered$temp <- as.factor(hotbev_filtered$temp)

hotbev_filtered$beverage <- as.factor(hotbev_filtered$beverage)

hotbev_filtered$caffeine_content <- as.factor(hotbev_filtered$caffeine_content)

hotbev_filtered$month <- as.factor(hotbev_filtered$month)

hotbev_filtered$day <- as.factor(hotbev_filtered$day)

Naive Bayes

I decided to use a Naive Bayes algorithm to predict day of the week based on variety, location, temperature, and caffeine content. I chose not to use month based on the varied lengths of the months and because I only collected data for 7 days in May.

Naive Bayes is a classification algorithm based on Bayes’ Theorem, a probability theorem. Naive Bayes is often used as a “classifying” algorithm, or one that assigns a class to each entry where there are a finite number of classes. Here, my classes are days of the week and I want to assign each time I had a hot beverage a day. It rests on an unrealistic assumption of independence between outcomes, considering no variable with another in making its predictions: here, for instance, it views beverage variety and and caffeine content as totally independent of each other, even though the variety “Herbal” always has a caffeine content of “Decaf.”

While its assumption of independence causes Naive Bayes to usually perform worse than random forests and other classification models, an advantage is that it requires a small amount of data to run. Since I have been collecting my data manually, and I don’t drink one million cups of tea per day, my data entries are limited and not very large. Therefore, I needed a model that could make predictions on a small amount of data, which Naive Bayes is able to do. It is an algorithm that decreases variance but introduces bias, and works quite well in practice.

Methods

Once I decided to use Naive Bayes, I split my data in half: one half - the training data -is used to “train” the model, where it could observe existing patterns between days of the week and caffeine content, variety, etcetera. The other half - the testing data - will be put into the Naive Bayes model I end up making modeled after the training data. I made a Naive Bayes model called tea_fit, then made predictions in a string called tea_preds, by putting the entries from the testing data into my tea_fit algorithm. I then compared these predictions to the actual values stored in my original hotbeverage_filtered data frame. I was also able to measure the overall accuracy of my predictions, which were usually around 12-13%. However, when I re-ran the training and testing data and my model, the accuracy would range from 9% to 17%. To find a more consistent accuracy to give to my model, I decided to “bootstrap” my process. Instead of manually re-splitting my data into two parts and making predictions a million times, then recording the accuracies and averaging them out by hand, I made a loop that would run my original code 5000 times and come up with 5000 accuracies. I then averaged these accuracies, which I stored in a vector called “accuracy”, to find out what a long-term accuracy score for my model would be. It ended up being around 14.5%, meaning that my model accurately predicts what day it is based on my beverage choice, temperature, variety, and caffeine content 14.5% of the time.

training_data_rows <- sample(1:nrow(hotbev_filtered), 
                             size = nrow(hotbev_filtered)/2)
tea_training <- hotbev_filtered[training_data_rows, ]
tea_testing <- hotbev_filtered[-training_data_rows, ]

#when i added date as a predictor accuracy increased to 60% which is interesting because in theory shouldn't every 7th date make day predictable??
tea_fit <- naiveBayes(day ~ caffeine_content + variety + beverage + temp, data = tea_training)

tea_preds <- predict(tea_fit, tea_testing)

confusion_matrix <- table(tea_preds, tea_testing$day)
print(confusion_matrix)
##            
## tea_preds   Friday Monday Saturday Sunday Thursday Tuesday Wednesday
##   Friday         2      3        5      1        4       2         2
##   Monday         2      4        2      1        3       5         2
##   Saturday       2      2        0      2        2       1         0
##   Sunday         6      2        9      7        7       3         6
##   Thursday       5      1        5      7        5       6         5
##   Tuesday        6      5        5      2        4       5         8
##   Wednesday      2      2        0      0        1       1         1
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.15"
set.seed(24)
B <- 1000
accuracy <- rep(NA, B)

for(b in 1:B){
training_data_rows <- sample(1:nrow(hotbev_filtered), 
                             size = nrow(hotbev_filtered)/2)
tea_training <- hotbev_filtered[training_data_rows, ]
tea_testing <- hotbev_filtered[-training_data_rows, ]
tea_fit <- naiveBayes(day ~ caffeine_content + variety + beverage + temp, data = tea_training)
tea_preds <- predict(tea_fit, tea_testing)
confusion_matrix <- table(tea_preds, tea_testing$day)
accuracy[b] <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
}

avg_accuracy <- mean(accuracy)
avg_accuracy
## [1] 0.1443313

This accuracy is not very high, so I wanted to explore my data a little more to see if I could find any patterns in my tea-drinking habits that the model would not be able to see.

I do not drink equal amounts of tea on every day of the week, which my model does not take into account: in fact, 17% of my beverages consumed are drunk on Thursdays. I do work at the on-campus cafe on Thursdays which could motivate higher beverage consumption, but I also decided to measure (with data from my Google Calendar) which days I tended to invite friends over for tea most. Lo and behold, 41% of my teatimes are on Thursdays! For my tea_fit model to be able to predict with more accuracy I would have to scale my predictions to predict Thursday the most often, then Tuesday, then Wednesday and Sunday, and so on.

prop.table(table(hotbev_filtered$day))*100
## 
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##  13.49693  13.49693  12.88344  14.41718  16.56442  14.72393  14.41718
#teatime by day! maya and dahlia's ideas :)
teatime <- data.frame("day"=c("Monday", "Tuesday", "Wednesday", "Thursday", "Tuesday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Thursday", "Wednesday", "Thursday", "Thursday", "Friday", "Thursday", "Thursday"))

teatime|>
  mutate(day2=factor(day,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")))|>
  ggplot(aes(x=day2))+
  geom_bar(fill="darkorange3")+
  labs(title="Phoebe's Teatimes by Day of the Week", x="day of the week", y="")

prop.table(table(teatime$day))*100
## 
##    Friday    Monday  Thursday   Tuesday Wednesday 
##  11.76471  11.76471  41.17647  17.64706  17.64706

There was also another variable I realized I hadn’t used: date! Obviously using that one is kind of cheating since, in theory, the model could realize that every 7th date corresponded to a day of the week, but when I added date as a predictor to my model, the accuracy only rose to ~60%, which seems low to me. This, of course, could be attributed to the independence assumption by Naive Bayes, if, for instance, I drank a type of tea I usually drink on Wednesdays on a Monday somewhere I usually am on Thursdays, perhaps the combination of these other factors could sway my model from predicting one day to another.

set.seed(24)
B <- 1000
accuracy <- rep(NA, B)

for(b in 1:B){
training_data_rows <- sample(1:nrow(hotbev_filtered), 
                             size = nrow(hotbev_filtered)/2)
tea_training <- hotbev_filtered[training_data_rows, ]
tea_testing <- hotbev_filtered[-training_data_rows, ]
tea_fit <- naiveBayes(day ~ caffeine_content + variety + beverage + temp + date, data = tea_training)
tea_preds <- predict(tea_fit, tea_testing)
confusion_matrix <- table(tea_preds, tea_testing$day)
accuracy[b] <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
}

avg_accuracy <- mean(accuracy)
avg_accuracy
## [1] 0.5690859

So I have learned that I drink the most tea on Thursdays and I have the most people over for tea on Thursdays!