My name is Phoebe and I drink a lot of tea! In 2025 I decided to track my hot beverage consumption. In this report, I am exploring the question of how my tea-drinking habits vary by day of the week, and whether there are any underlying patterns that my beverage preferences can be attributed to, using a Naive Bayes predictor.
I began by exploring my data a little bit to see if there were any visible patterns and where I drank the most tea and what types of tea I had the most. For Type of Beverage and Location I filtered for n>1 to make the plots easier to look at. I am a worldly woman and there have been a lot of teas I have tried once but never again, so I removed those from my plots for visual purposes.
I filtered my dataset before using the algorithm to make the predictions a little easier. I removed coffee, hot chocolate, and other beverages with variety “N/A” so that I could use variety as a predictor. I also made temperature, caffeine content, day, and month factors so they could be used as predictors.
hotbev_filtered <- hotbev|>
filter(variety!="N/A")
#let's make some dummy variables since all of mine are qualitative!
hotbev_filtered$variety <- as.factor(hotbev_filtered$variety)
hotbev_filtered$temp <- as.factor(hotbev_filtered$temp)
hotbev_filtered$beverage <- as.factor(hotbev_filtered$beverage)
hotbev_filtered$caffeine_content <- as.factor(hotbev_filtered$caffeine_content)
hotbev_filtered$month <- as.factor(hotbev_filtered$month)
hotbev_filtered$day <- as.factor(hotbev_filtered$day)
Naive Bayes
I decided to use a Naive Bayes algorithm to predict day of the week based on variety, location, temperature, and caffeine content. I chose not to use month based on the varied lengths of the months and because I only collected data for 7 days in May.
Naive Bayes is a classification algorithm based on Bayes’ Theorem, a probability theorem. Naive Bayes is often used as a “classifying” algorithm, or one that assigns a class to each entry where there are a finite number of classes. Here, my classes are days of the week and I want to assign each time I had a hot beverage a day. It rests on an unrealistic assumption of independence between outcomes, considering no variable with another in making its predictions: here, for instance, it views beverage variety and and caffeine content as totally independent of each other, even though the variety “Herbal” always has a caffeine content of “Decaf.”
While its assumption of independence causes Naive Bayes to usually perform worse than random forests and other classification models, an advantage is that it requires a small amount of data to run. Since I have been collecting my data manually, and I don’t drink one million cups of tea per day, my data entries are limited and not very large. Therefore, I needed a model that could make predictions on a small amount of data, which Naive Bayes is able to do. It is an algorithm that decreases variance but introduces bias, and works quite well in practice.
Methods
Once I decided to use Naive Bayes, I split my data in half: one half - the training data -is used to “train” the model, where it could observe existing patterns between days of the week and caffeine content, variety, etcetera. The other half - the testing data - will be put into the Naive Bayes model I end up making modeled after the training data. I made a Naive Bayes model called tea_fit, then made predictions in a string called tea_preds, by putting the entries from the testing data into my tea_fit algorithm. I then compared these predictions to the actual values stored in my original hotbeverage_filtered data frame. I was also able to measure the overall accuracy of my predictions, which were usually around 12-13%. However, when I re-ran the training and testing data and my model, the accuracy would range from 9% to 17%. To find a more consistent accuracy to give to my model, I decided to “bootstrap” my process. Instead of manually re-splitting my data into two parts and making predictions a million times, then recording the accuracies and averaging them out by hand, I made a loop that would run my original code 5000 times and come up with 5000 accuracies. I then averaged these accuracies, which I stored in a vector called “accuracy”, to find out what a long-term accuracy score for my model would be. It ended up being around 14.5%, meaning that my model accurately predicts what day it is based on my beverage choice, temperature, variety, and caffeine content 14.5% of the time.
training_data_rows <- sample(1:nrow(hotbev_filtered),
size = nrow(hotbev_filtered)/2)
tea_training <- hotbev_filtered[training_data_rows, ]
tea_testing <- hotbev_filtered[-training_data_rows, ]
#when i added date as a predictor accuracy increased to 60% which is interesting because in theory shouldn't every 7th date make day predictable??
tea_fit <- naiveBayes(day ~ caffeine_content + variety + beverage + temp, data = tea_training)
tea_preds <- predict(tea_fit, tea_testing)
confusion_matrix <- table(tea_preds, tea_testing$day)
print(confusion_matrix)
##
## tea_preds Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## Friday 2 3 5 1 4 2 2
## Monday 2 4 2 1 3 5 2
## Saturday 2 2 0 2 2 1 0
## Sunday 6 2 9 7 7 3 6
## Thursday 5 1 5 7 5 6 5
## Tuesday 6 5 5 2 4 5 8
## Wednesday 2 2 0 0 1 1 1
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 2)))
## [1] "Accuracy: 0.15"
set.seed(24)
B <- 1000
accuracy <- rep(NA, B)
for(b in 1:B){
training_data_rows <- sample(1:nrow(hotbev_filtered),
size = nrow(hotbev_filtered)/2)
tea_training <- hotbev_filtered[training_data_rows, ]
tea_testing <- hotbev_filtered[-training_data_rows, ]
tea_fit <- naiveBayes(day ~ caffeine_content + variety + beverage + temp, data = tea_training)
tea_preds <- predict(tea_fit, tea_testing)
confusion_matrix <- table(tea_preds, tea_testing$day)
accuracy[b] <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
}
avg_accuracy <- mean(accuracy)
avg_accuracy
## [1] 0.1443313
This accuracy is not very high, so I wanted to explore my data a little more to see if I could find any patterns in my tea-drinking habits that the model would not be able to see.
I do not drink equal amounts of tea on every day of the week, which my model does not take into account: in fact, 17% of my beverages consumed are drunk on Thursdays. I do work at the on-campus cafe on Thursdays which could motivate higher beverage consumption, but I also decided to measure (with data from my Google Calendar) which days I tended to invite friends over for tea most. Lo and behold, 41% of my teatimes are on Thursdays! For my tea_fit model to be able to predict with more accuracy I would have to scale my predictions to predict Thursday the most often, then Tuesday, then Wednesday and Sunday, and so on.
prop.table(table(hotbev_filtered$day))*100
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 13.49693 13.49693 12.88344 14.41718 16.56442 14.72393 14.41718
#teatime by day! maya and dahlia's ideas :)
teatime <- data.frame("day"=c("Monday", "Tuesday", "Wednesday", "Thursday", "Tuesday", "Tuesday", "Wednesday", "Thursday", "Friday", "Monday", "Thursday", "Wednesday", "Thursday", "Thursday", "Friday", "Thursday", "Thursday"))
teatime|>
mutate(day2=factor(day,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")))|>
ggplot(aes(x=day2))+
geom_bar(fill="darkorange3")+
labs(title="Phoebe's Teatimes by Day of the Week", x="day of the week", y="")
prop.table(table(teatime$day))*100
##
## Friday Monday Thursday Tuesday Wednesday
## 11.76471 11.76471 41.17647 17.64706 17.64706
There was also another variable I realized I hadn’t used: date! Obviously using that one is kind of cheating since, in theory, the model could realize that every 7th date corresponded to a day of the week, but when I added date as a predictor to my model, the accuracy only rose to ~60%, which seems low to me. This, of course, could be attributed to the independence assumption by Naive Bayes, if, for instance, I drank a type of tea I usually drink on Wednesdays on a Monday somewhere I usually am on Thursdays, perhaps the combination of these other factors could sway my model from predicting one day to another.
set.seed(24)
B <- 1000
accuracy <- rep(NA, B)
for(b in 1:B){
training_data_rows <- sample(1:nrow(hotbev_filtered),
size = nrow(hotbev_filtered)/2)
tea_training <- hotbev_filtered[training_data_rows, ]
tea_testing <- hotbev_filtered[-training_data_rows, ]
tea_fit <- naiveBayes(day ~ caffeine_content + variety + beverage + temp + date, data = tea_training)
tea_preds <- predict(tea_fit, tea_testing)
confusion_matrix <- table(tea_preds, tea_testing$day)
accuracy[b] <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
}
avg_accuracy <- mean(accuracy)
avg_accuracy
## [1] 0.5690859
So I have learned that I drink the most tea on Thursdays and I have the most people over for tea on Thursdays!