This portfolio highlights selected work from my Spring 2026 semester taking POSC497: Categorical Data Visualization and Data Visualization. Across these labs, I used R to clean data, summarize patterns, build visualizations, create interactive maps, and apply machine learning models.
I chose these labs to show my improvement across the course, moving from data wrangling and exploratory analysis to visualization, mapping, and predictive modeling.
dplyrIn this lab, I used the Titanic dataset to examine whether the “women
and children first” rule was reflected in passenger survival patterns. I
used dplyr functions such as select(),
mutate(), filter(), group_by(),
and summarize() to compare death rates by sex and age
group.
| Sex | age_group | passengers | deaths | death_rate |
|---|---|---|---|---|
| male | Adult | 395 | 325 | 0.823 |
| male | Child | 58 | 35 | 0.603 |
| female | Child | 55 | 17 | 0.309 |
| female | Adult | 206 | 47 | 0.228 |
The table shows that adult men had the highest death rate, while adult women had a much lower death rate. This supports the idea that gender strongly shaped survival outcomes. However, the child survival results complicate the “women and children first” narrative because male children died at a much higher rate than female children. If the rule was truly upheld properly, sex wouldn’t matter when it came to children. This suggests that the rule was not applied evenly across all groups.
In this collaborative class lab, we analyzed the Youth Participatory Politics Survey Project data to compare average support for President Obama across racial groups. The visualization below shows group means with 95% confidence intervals.
The plot compares average support for President Obama across racial groups. The red points represent each group’s mean support, while the blue error bars show uncertainty around those means. The horizontal black line represents the overall sample mean. This visualization makes it easier to compare each group’s average support against the overall average.
In this lab, I created an interactive choropleth map showing Democratic vote share across East Coast states in the 2020 presidential election. The map uses darker shades of blue to represent higher Democratic vote share. Clicking on a state reveals its Democratic and Republican vote percentages and vote counts.
The map shows that Democratic vote share varied across the East Coast. Darker states had a higher Democratic vote percentage, while lighter states had lower Democratic vote share. The interactive popups make the visualization more useful because readers can inspect both percentages and raw vote counts for each state.
In this lab, I used machine learning models to predict Titanic survival using passenger characteristics such as sex, passenger class, age, family size, and fare. I compared a Generalized Linear Model, Random Forest, and AutoML models to evaluate which approach predicted survival most effectively.
The images below are included as saved plot files. To use this
section, place your Lab 11 images in an images folder with
the exact file names shown in the code chunks.
The GLM plots show that sex and passenger class were the strongest predictors of survival. Female passengers and first-class passengers were associated with higher predicted survival, while male passengers and third-class passengers were associated with lower predicted survival. Age and fare also contributed to the model, but they were weaker predictors than sex and class.
The Partial Dependence Plots show how predicted survival changes as one predictor changes while the others are averaged out. The sex plot shows a much higher predicted survival response for female passengers than for male passengers. The passenger class plot shows higher predicted survival for first-class passengers compared to lower classes. The fare plot suggests that predicted survival generally increases as fare rises, though fare appears less direct than sex or passenger class because it overlaps with class.
| model_id | auc | logloss | aucpr | mean_per_class_error | rmse | mse |
|---|---|---|---|---|---|---|
| StackedEnsemble_BestOfFamily_1_AutoML_1_20260515_155945 | 0.8705 | 0.4114 | 0.8433 | 0.1800 | 0.3563 | 0.1269 |
| GBM_4_AutoML_1_20260515_155945 | 0.8696 | 0.4135 | 0.8424 | 0.1788 | 0.3571 | 0.1275 |
| StackedEnsemble_AllModels_1_AutoML_1_20260515_155945 | 0.8655 | 0.4152 | 0.8416 | 0.1821 | 0.3575 | 0.1278 |
| GBM_grid_1_AutoML_1_20260515_155945_model_1 | 0.8615 | 0.4329 | 0.8355 | 0.2026 | 0.3682 | 0.1356 |
| GBM_2_AutoML_1_20260515_155945 | 0.8610 | 0.4213 | 0.8328 | 0.1904 | 0.3609 | 0.1303 |
| GBM_3_AutoML_1_20260515_155945 | 0.8606 | 0.4223 | 0.8377 | 0.1942 | 0.3621 | 0.1311 |
| GBM_grid_1_AutoML_1_20260515_155945_model_3 | 0.8601 | 0.4293 | 0.8305 | 0.1974 | 0.3672 | 0.1348 |
| GBM_5_AutoML_1_20260515_155945 | 0.8574 | 0.4172 | 0.8393 | 0.1877 | 0.3582 | 0.1283 |
| DeepLearning_grid_2_AutoML_1_20260515_155945_model_1 | 0.8568 | 0.4615 | 0.8245 | 0.1939 | 0.3739 | 0.1398 |
| GBM_grid_1_AutoML_1_20260515_155945_model_2 | 0.8562 | 0.4181 | 0.8305 | 0.1911 | 0.3594 | 0.1292 |
The GLM achieved an AUC of approximately 0.860, while the Random Forest achieved an AUC of approximately 0.892. Since AUC measures how well a model separates passengers who survived from passengers who did not survive, the Random Forest performed better overall.
The AutoML leaderboard also showed strong performance from ensemble and tree-based models. However, the GLM remains valuable because its coefficient plots make the direction of each predictor’s relationship with survival easier to explain.
Across these labs, I practiced moving from raw data to meaningful analysis. Lab 3 emphasized data wrangling and summary tables, Lab 5 focused on statistical visualization, Lab 10 introduced interactive mapping, and Lab 11 applied machine learning to prediction and interpretation. Together, these projects show how R can be used not solely to produce graphs and models, but also to tell clearer stories with data.