Portfolio Overview

This portfolio highlights selected work from my Spring 2026 semester taking POSC497: Categorical Data Visualization and Data Visualization. Across these labs, I used R to clean data, summarize patterns, build visualizations, create interactive maps, and apply machine learning models.

I chose these labs to show my improvement across the course, moving from data wrangling and exploratory analysis to visualization, mapping, and predictive modeling.

Lab 3: Exploring the Titanic Dataset with dplyr

In this lab, I used the Titanic dataset to examine whether the “women and children first” rule was reflected in passenger survival patterns. I used dplyr functions such as select(), mutate(), filter(), group_by(), and summarize() to compare death rates by sex and age group.

Death rates by sex and age group in the Titanic dataset
Sex age_group passengers deaths death_rate
male Adult 395 325 0.823
male Child 58 35 0.603
female Child 55 17 0.309
female Adult 206 47 0.228

The table shows that adult men had the highest death rate, while adult women had a much lower death rate. This supports the idea that gender strongly shaped survival outcomes. However, the child survival results complicate the “women and children first” narrative because male children died at a much higher rate than female children. If the rule was truly upheld properly, sex wouldn’t matter when it came to children. This suggests that the rule was not applied evenly across all groups.

Lab 5: Comparing Support for President Obama by Race

In this collaborative class lab, we analyzed the Youth Participatory Politics Survey Project data to compare average support for President Obama across racial groups. The visualization below shows group means with 95% confidence intervals.

The plot compares average support for President Obama across racial groups. The red points represent each group’s mean support, while the blue error bars show uncertainty around those means. The horizontal black line represents the overall sample mean. This visualization makes it easier to compare each group’s average support against the overall average.

Lab 10: Interactive East Coast Election Map

In this lab, I created an interactive choropleth map showing Democratic vote share across East Coast states in the 2020 presidential election. The map uses darker shades of blue to represent higher Democratic vote share. Clicking on a state reveals its Democratic and Republican vote percentages and vote counts.

The map shows that Democratic vote share varied across the East Coast. Darker states had a higher Democratic vote percentage, while lighter states had lower Democratic vote share. The interactive popups make the visualization more useful because readers can inspect both percentages and raw vote counts for each state.

Lab 11: Machine Learning and Titanic Survival

In this lab, I used machine learning models to predict Titanic survival using passenger characteristics such as sex, passenger class, age, family size, and fare. I compared a Generalized Linear Model, Random Forest, and AutoML models to evaluate which approach predicted survival most effectively.

The images below are included as saved plot files. To use this section, place your Lab 11 images in an images folder with the exact file names shown in the code chunks.

The GLM plots show that sex and passenger class were the strongest predictors of survival. Female passengers and first-class passengers were associated with higher predicted survival, while male passengers and third-class passengers were associated with lower predicted survival. Age and fare also contributed to the model, but they were weaker predictors than sex and class.

The Partial Dependence Plots show how predicted survival changes as one predictor changes while the others are averaged out. The sex plot shows a much higher predicted survival response for female passengers than for male passengers. The passenger class plot shows higher predicted survival for first-class passengers compared to lower classes. The fare plot suggests that predicted survival generally increases as fare rises, though fare appears less direct than sex or passenger class because it overlaps with class.

Top AutoML models for Titanic survival prediction
model_id auc logloss aucpr mean_per_class_error rmse mse
StackedEnsemble_BestOfFamily_1_AutoML_1_20260515_155945 0.8705 0.4114 0.8433 0.1800 0.3563 0.1269
GBM_4_AutoML_1_20260515_155945 0.8696 0.4135 0.8424 0.1788 0.3571 0.1275
StackedEnsemble_AllModels_1_AutoML_1_20260515_155945 0.8655 0.4152 0.8416 0.1821 0.3575 0.1278
GBM_grid_1_AutoML_1_20260515_155945_model_1 0.8615 0.4329 0.8355 0.2026 0.3682 0.1356
GBM_2_AutoML_1_20260515_155945 0.8610 0.4213 0.8328 0.1904 0.3609 0.1303
GBM_3_AutoML_1_20260515_155945 0.8606 0.4223 0.8377 0.1942 0.3621 0.1311
GBM_grid_1_AutoML_1_20260515_155945_model_3 0.8601 0.4293 0.8305 0.1974 0.3672 0.1348
GBM_5_AutoML_1_20260515_155945 0.8574 0.4172 0.8393 0.1877 0.3582 0.1283
DeepLearning_grid_2_AutoML_1_20260515_155945_model_1 0.8568 0.4615 0.8245 0.1939 0.3739 0.1398
GBM_grid_1_AutoML_1_20260515_155945_model_2 0.8562 0.4181 0.8305 0.1911 0.3594 0.1292

The GLM achieved an AUC of approximately 0.860, while the Random Forest achieved an AUC of approximately 0.892. Since AUC measures how well a model separates passengers who survived from passengers who did not survive, the Random Forest performed better overall.

The AutoML leaderboard also showed strong performance from ensemble and tree-based models. However, the GLM remains valuable because its coefficient plots make the direction of each predictor’s relationship with survival easier to explain.

Overall Reflection

Across these labs, I practiced moving from raw data to meaningful analysis. Lab 3 emphasized data wrangling and summary tables, Lab 5 focused on statistical visualization, Lab 10 introduced interactive mapping, and Lab 11 applied machine learning to prediction and interpretation. Together, these projects show how R can be used not solely to produce graphs and models, but also to tell clearer stories with data.