This portfolio highlights work from taking POSC497: Categorical Data Visualization and Data Visualization in Spring 2026. I chose Lab 3, Lab 5, Lab 10, and Lab 11.
I chose these labs specifically because they all stood out in both showcasing my growth and as personal favorites.
In this lab, I used dplyr functions to compare death rates by sex and age group to examine whether the “women and children first” rule was reflected in passenger survival patterns.
| Sex | age_group | passengers | deaths | death_rate |
|---|---|---|---|---|
| male | Adult | 395 | 325 | 0.823 |
| male | Child | 58 | 35 | 0.603 |
| female | Child | 55 | 17 | 0.309 |
| female | Adult | 206 | 47 | 0.228 |
The table shows us that adult men had the highest death rate, while adult women had a much lower death rate. This supports the idea that women were prioritized when it came to survival. However, the child survival results complicate the “women and children first” narrative because male children died at a much higher rate than female children. If the rule was truly upheld properly, sex wouldn’t matter when it came to children. This suggests that the rule was not applied evenly across all groups.
In this collaborative class lab, we analyzed the Youth Participatory Politics Survey Project data to compare average support for President Obama across racial groups. The red points represent each group’s mean support, while the blue error bars show uncertainty around those means.
The image below shows the class’s first interpretation of what the plot might look like.
In this lab, I created an interactive choropleth map showing Democratic voter percentages across East Coast states in the 2020 presidential election. The map uses darker shades of blue to represent higher Democratic vote share. Clicking on a state reveals its Democratic and Republican vote percentages and vote counts.
The map shows that Democratic vote share varied across the East Coast. Darker states had a higher Democratic vote percentage, while lighter states had lower Democratic vote share. The interactive popups make the visualization more useful because readers can inspect both percentages and raw vote counts for each state.
In this lab, I used machine learning models to predict Titanic survival using passenger characteristics such as sex, passenger class, age, family size, and fare. Generalized Linear Model (GLM), Random Forest, and AutoML models were all compared to evaluate which approach predicted survival most effectively.
The GLM plots show that sex and passenger class were the strongest predictors of survival. As we found in Lab 3, Female passengers were associated with higher predicted survival, while male passengers were associated with lower predicted survival. However, thanks to the ML models, factors like far and class could also be factored in. Age and fare also contributed to the findings of this model, but they weren’t as good of predictors compared to sex and class.
The sex Partial Dependence Plot (PDP) shows a much higher predicted survival response for female passengers than for male passengers. The passenger class plot also shows higher predicted survival, however, that plot was comparing first-class passengers to the lower classes. The fare plot tells us that predicted survival generally increases as fare rises, though fare appears less direct than sex or passenger class because it somewhat overlaps with class.
| model_id | auc | logloss | aucpr | mean_per_class_error | rmse | mse |
|---|---|---|---|---|---|---|
| StackedEnsemble_BestOfFamily_1_AutoML_1_20260515_155945 | 0.8705 | 0.4114 | 0.8433 | 0.1800 | 0.3563 | 0.1269 |
| GBM_4_AutoML_1_20260515_155945 | 0.8696 | 0.4135 | 0.8424 | 0.1788 | 0.3571 | 0.1275 |
| StackedEnsemble_AllModels_1_AutoML_1_20260515_155945 | 0.8655 | 0.4152 | 0.8416 | 0.1821 | 0.3575 | 0.1278 |
| GBM_grid_1_AutoML_1_20260515_155945_model_1 | 0.8615 | 0.4329 | 0.8355 | 0.2026 | 0.3682 | 0.1356 |
| GBM_2_AutoML_1_20260515_155945 | 0.8610 | 0.4213 | 0.8328 | 0.1904 | 0.3609 | 0.1303 |
| GBM_3_AutoML_1_20260515_155945 | 0.8606 | 0.4223 | 0.8377 | 0.1942 | 0.3621 | 0.1311 |
| GBM_grid_1_AutoML_1_20260515_155945_model_3 | 0.8601 | 0.4293 | 0.8305 | 0.1974 | 0.3672 | 0.1348 |
| GBM_5_AutoML_1_20260515_155945 | 0.8574 | 0.4172 | 0.8393 | 0.1877 | 0.3582 | 0.1283 |
| DeepLearning_grid_2_AutoML_1_20260515_155945_model_1 | 0.8568 | 0.4615 | 0.8245 | 0.1939 | 0.3739 | 0.1398 |
| GBM_grid_1_AutoML_1_20260515_155945_model_2 | 0.8562 | 0.4181 | 0.8305 | 0.1911 | 0.3594 | 0.1292 |
The GLM achieved an AUC of approximately 0.86, while the Random Forest achieved an AUC of approximately 0.88. Since AUC measures how well a model separates passengers who survived from passengers who did not survive, i’d say that the Random Forest performed better overall.
The AutoML leaderboard also showed strong performance from ensemble and tree-based models.
Lab 3 emphasized data wrangling and summary tables, Lab 5 focused on statistical visualization, Lab 10 introduced interactive mapping, and Lab 11 applied machine learning to prediction and interpretation.
All in all, these labs highlighy my growing skills with using R for data visualization and statistical analysis.