Portfolio Overview

This portfolio highlights work from taking POSC497: Categorical Data Visualization and Data Visualization in Spring 2026. I chose Lab 3, Lab 5, Lab 10, and Lab 11.

I chose these labs specifically because they all stood out in both showcasing my growth and as personal favorites.

Lab 3: Exploring the Titanic Dataset with dplyr

In this lab, I used dplyr functions to compare death rates by sex and age group to examine whether the “women and children first” rule was reflected in passenger survival patterns.

Death rates by sex and age group in the Titanic dataset
Sex age_group passengers deaths death_rate
male Adult 395 325 0.823
male Child 58 35 0.603
female Child 55 17 0.309
female Adult 206 47 0.228

The table shows us that adult men had the highest death rate, while adult women had a much lower death rate. This supports the idea that women were prioritized when it came to survival. However, the child survival results complicate the “women and children first” narrative because male children died at a much higher rate than female children. If the rule was truly upheld properly, sex wouldn’t matter when it came to children. This suggests that the rule was not applied evenly across all groups.

Lab 5: Comparing Support for President Obama by Race

In this collaborative class lab, we analyzed the Youth Participatory Politics Survey Project data to compare average support for President Obama across racial groups. The red points represent each group’s mean support, while the blue error bars show uncertainty around those means.

The image below shows the class’s first interpretation of what the plot might look like.

Lab 10: Interactive East Coast Election Map

In this lab, I created an interactive choropleth map showing Democratic voter percentages across East Coast states in the 2020 presidential election. The map uses darker shades of blue to represent higher Democratic vote share. Clicking on a state reveals its Democratic and Republican vote percentages and vote counts.

The map shows that Democratic vote share varied across the East Coast. Darker states had a higher Democratic vote percentage, while lighter states had lower Democratic vote share. The interactive popups make the visualization more useful because readers can inspect both percentages and raw vote counts for each state.

Lab 11: Machine Learning and Titanic Survival

In this lab, I used machine learning models to predict Titanic survival using passenger characteristics such as sex, passenger class, age, family size, and fare. Generalized Linear Model (GLM), Random Forest, and AutoML models were all compared to evaluate which approach predicted survival most effectively.

The GLM plots show that sex and passenger class were the strongest predictors of survival. As we found in Lab 3, Female passengers were associated with higher predicted survival, while male passengers were associated with lower predicted survival. However, thanks to the ML models, factors like far and class could also be factored in. Age and fare also contributed to the findings of this model, but they weren’t as good of predictors compared to sex and class.

The sex Partial Dependence Plot (PDP) shows a much higher predicted survival response for female passengers than for male passengers. The passenger class plot also shows higher predicted survival, however, that plot was comparing first-class passengers to the lower classes. The fare plot tells us that predicted survival generally increases as fare rises, though fare appears less direct than sex or passenger class because it somewhat overlaps with class.

Top AutoML models for Titanic survival prediction
model_id auc logloss aucpr mean_per_class_error rmse mse
StackedEnsemble_BestOfFamily_1_AutoML_1_20260515_155945 0.8705 0.4114 0.8433 0.1800 0.3563 0.1269
GBM_4_AutoML_1_20260515_155945 0.8696 0.4135 0.8424 0.1788 0.3571 0.1275
StackedEnsemble_AllModels_1_AutoML_1_20260515_155945 0.8655 0.4152 0.8416 0.1821 0.3575 0.1278
GBM_grid_1_AutoML_1_20260515_155945_model_1 0.8615 0.4329 0.8355 0.2026 0.3682 0.1356
GBM_2_AutoML_1_20260515_155945 0.8610 0.4213 0.8328 0.1904 0.3609 0.1303
GBM_3_AutoML_1_20260515_155945 0.8606 0.4223 0.8377 0.1942 0.3621 0.1311
GBM_grid_1_AutoML_1_20260515_155945_model_3 0.8601 0.4293 0.8305 0.1974 0.3672 0.1348
GBM_5_AutoML_1_20260515_155945 0.8574 0.4172 0.8393 0.1877 0.3582 0.1283
DeepLearning_grid_2_AutoML_1_20260515_155945_model_1 0.8568 0.4615 0.8245 0.1939 0.3739 0.1398
GBM_grid_1_AutoML_1_20260515_155945_model_2 0.8562 0.4181 0.8305 0.1911 0.3594 0.1292

The GLM achieved an AUC of approximately 0.86, while the Random Forest achieved an AUC of approximately 0.88. Since AUC measures how well a model separates passengers who survived from passengers who did not survive, i’d say that the Random Forest performed better overall.

The AutoML leaderboard also showed strong performance from ensemble and tree-based models.

Overall Reflection

Lab 3 emphasized data wrangling and summary tables, Lab 5 focused on statistical visualization, Lab 10 introduced interactive mapping, and Lab 11 applied machine learning to prediction and interpretation.

All in all, these labs highlighy my growing skills with using R for data visualization and statistical analysis.