SOCI523 R assignment

Author

Student ID: 201797565

Part 1 - Analysis of Visualization

Source: ‘Cigarette sales and lung cancer mortality in the US’ (Roser, 2021).

1.1 Introduction and Epidemiological Context

This visualization, published by Our World in Data, presents long-term trends in cigarette sales per adult and lung cancer mortality among men in the United States across the twentieth century. This figure highlights the pronounced epidemiological time lag of approximately three decades between population-level cigarette consumption and the later rise in lung cancer mortality (Roser, 2021).

This pattern aligns with established epidemiological research on the long-term dynamics of smoking-related harm. The chart essentially visualizes the “four-stage model of the cigarette epidemic” famously established by Lopez et al. (1994). According to this epidemiological framework, there is a typically observed 30-40 year delay between peak smoking prevalence and peak mortality rates within a population (Lopez et al., 1994). By annotating the chart with key historical interventions, such as the 1964 Surgeon General’s report and various federal tax increases, the visualization does not merely present data; it maps the history of public health policy. As researchers like Holford et al. (2014) have noted, such policy milestones were instrumental in curbing consumption, which eventually led to the decline in mortality depicted on the far right of the chart. In this sense, the visualization serves as a powerful testament to both the delayed consequences of tobacco consumption and the eventual efficacy of legislative public health interventions.

At the same time, this clarity relies on presenting a complex epidemiological process as two aligned curves, thereby streamlining the social and biological dynamics that typically unfold over decades.

1.2 Technical Execution: Formatting Choices and Selective Representation

This visualization employs several design techniques that are both contested and effective. One of the most significant formatting choices is the use of dual Y-axes. The chart maps cigarette sales per adult per day on a purple axis (0-11) and the rate of lung cancer deaths per 100,000 men on a red axis (0-55). From a strict data visualization perspective, dual axes are frequently criticized. Cairo (2019) argues that dual axes can be highly misleading, as the chart creator can arbitrarily scale the axes to force lines to intersect or move in parallel, potentially implying spurious correlations. However, while dual axes remain contentious in information design, in this instance the approach arguably supports the figure’s explanatory aim. The goal here is not to compare the absolute magnitude of sales versus deaths, which are fundamentally incommensurable units, but rather to compare the shape of their trends and the temporal displacement between them. If these two variables were separated into two distinct charts, the visceral impact of seeing the “mortality curve” mirroring the “sales curve” 30 years later would be significantly reduced.

Furthermore, the chart excels in its cognitive design. Following the principles of visual perception outlined by Ware (2022), it utilizes direct color mapping (purple line to purple axis text, red line to red axis text) based on Gestalt principles of similarity, which significantly reduces the reader’s cognitive load. Additionally, the inclusion of a dense annotation layer transforms abstract statistical lines into a coherent narrative. Instead of requiring the reader to cross-reference historical dates, the author embeds the historical context directly into the data space, adhering to best practices of data storytelling (Smith, 2022).

Beyond the mechanical formatting of the axes, a further critique concerns the issue of selective representation. A close inspection of the labels reveals a significant mismatch in denominators: the purple sales curve represents consumption “per adult” (implicitly universal), while the red mortality curve isolates deaths “per 100,000 men” (Roser, 2021). By pairing aggregate per-adult consumption data with male-specific mortality rates, the visualization introduces an asymmetry in denominators. This mismatch produces a cleaner visual alignment than a gender-disaggregated comparison might allow. However, this design choice narrows the demographic context of the narrative, rendering women’s mortality less visible within the visual frame. From the perspective advanced by D’Ignazio and Klein (2020), such simplifications illustrate how data visualizations can privilege clarity and narrative force over representational completeness.

1.3 Conclusion: Historical Precedents and Contemporary Implications

This visualization represents a balance between technical design and narrative clarity, despite certain limitations in its gendered data representation. While the dual-axis format demands reader caution, it effectively translates complex epidemiological phenomena into an accessible visual story about the delayed but fatal consequences of smoking.

More broadly, the sociological significance of latency effects remains highly relevant for contemporary public health governance. The visualization reminds us that behavioural risk factors often generate consequences that unfold over decades, complicating political decision-making and public perception. In this sense, the chart functions not only as a historical record but also as a reflection on the temporal dynamics of risk, responsibility, and intervention.

References

Cairo, A. (2019) How Charts Lie: Getting Smarter about Visual Information. W.W. Norton.

D’Ignazio, C. and Klein, L. F. (2020) Data Feminism. Cambridge: The MIT Press.

Holford, T. R., Meza, R., Warner, K. E., et al. (2014) ‘Tobacco control and the reduction in smoking-related premature deaths in the United States, 1964-2012’. JAMA, 311(2), 164–171.

Lopez, A. D., Collishaw, N. E., and Piha, T. (1994) ‘A descriptive model of the cigarette epidemic in developed countries’. Tobacco Control, 3(3), 242-247.

Roser, M. (2021) Smoking. Our World in Data. Available at: https://ourworldindata.org/smoking (Accessed: 24 April 2026).

Smith, A. (2022) How charts work: understand and explain data with confidence. Harlow: Pearson Education.

Ware, C. (2022) Visual thinking for information design. Cambridge: Elsevier.

Part 2 - A Decadal Analysis of E-Cigarette Transition and User Profiles (2010-2024)

2.1 Introduction

Over the past two decades, the global landscape of nicotine consumption has undergone a structural transformation. As the prevalence of combustible cigarettes declines in response to public health interventions, electronic cigarettes (e-cigarettes) have rapidly emerged as a prominent alternative. This shift reflects not only product substitution but also changes in patterns of nicotine use and dependence. Understanding these dynamics requires attention to both harm reduction among existing smokers and the potential emergence of new forms of nicotine use (Samet and Barrington-Trimis, 2021).

This report draws on National Health Interview Survey (NHIS) data from 2010 to 2024 to examine three dimensions of this transition: temporal trends in nicotine use, decadal changes in user populations, and the demographic profile of contemporary users.

2.2 Data Sources & Methodology

Data for this analysis were drawn from the IPUMS Health Surveys: National Health Interview Survey (NHIS) (Blewett et al., 2023), including adult respondents from the specified survey period. Nicotine use patterns were classified using binary indicators for combustible cigarette and e-cigarette use. Based on these indicators, respondents were grouped into three categories: tobacco-only users, e-cigarette-only users, and dual users.

Survey weights (SAMPWEIGHT) were applied to account for the survey design and differential non-response, ensuring that all estimates are representative of the U.S. adult population. For the age distribution analysis in Graph 3, weighted terciles were calculated using the Hmisc package (Harrell, 2024). These thresholds divide the weighted population of e-cigarette users into three equal groups and summarise variation across age.

Data Preparation: Cleaning and Translation

library(readr)
# Import the raw NHIS dataset
raw_smoke_status <- read_csv("us_smoke_status_2010-2024.csv")
library(dplyr)
library(tidyr)

# Clean and translate data based on the survey code book
smoker_fact <- raw_smoke_status |>
# Exclude rows with invalid or missing smoking/vaping status
  filter(!SMOKESTATUS2 %in% c(40, 90)) |>
  filter(!ECIGED %in% c(7, 8, 9)) |>
  # Create logical flags and define User Group categories
  mutate(
    is_tobacco = ifelse(SMOKESTATUS2 %in% c(10, 11, 12, 13), TRUE, FALSE),
    is_ecig = ifelse(ECIGED %in% c(1, 2), TRUE, FALSE),
    # Logical mapping: Assign users into three groups
    User_Group = case_when(
      is_tobacco == TRUE  & is_ecig == TRUE  ~ "Dual Users",
      is_tobacco == TRUE  & is_ecig == FALSE ~ "Tobacco Only",
      is_tobacco == FALSE & is_ecig == TRUE  ~ "E-cigarette Only",
      TRUE                                   ~ NA_character_),
    # Gender translation: Convert numeric codes to labels
    Gender = case_when(
      SEX == 1 ~ "Male",
      SEX == 2 ~ "Female",
      TRUE     ~ NA_character_)) |>
  # Remove rows outside of defined groups
  drop_na(User_Group)

2.3 Analysis of Visualisations

Graph 2: Decadal Shift in Nicotine Product Users (2014–2024)

Graph 2 examines the net change in nicotine users between 2014 and 2024, converting proportions into population estimates. This helps clarify the scale of change across user groups.

# --- Prepare data for Graph 2 ---
plot2_df1 <- smoker_fact |>
  filter(YEAR >= 2014) |>
  group_by(YEAR, User_Group) |>
  # Estimate weighted population in millions
  summarise(smoker_pop_millions = sum(SAMPWEIGHT) / 1e6, .groups = "drop") |>
  group_by(YEAR) |>
  # Calculate group proportion within each year
  mutate(prop = smoker_pop_millions / sum(smoker_pop_millions)) |>
  ungroup()

plot2_df2 <- plot2_df1 |>
  # Compare 10-year volume changes across user categories (2014 vs 2024)
  filter(YEAR %in% c(2014, 2024)) |>
  select(YEAR, User_Group, smoker_pop_millions) |>
  # Reshape to wide format for year-to-year comparison
  pivot_wider(names_from = YEAR, names_prefix = "yr",
              values_from = smoker_pop_millions) |>
  # Calculate 10-year net change and growth rate
  mutate(net_change = yr2024 - yr2014,
         growth_rate = net_change / yr2014)

plot2_df3 <- plot2_df2 |>
  # Summarise metrics for the "Total User" category
  summarise( yr2014 = sum(yr2014),
             yr2024 = sum(yr2024),
             net_change = yr2024 - yr2014,
             growth_rate = net_change / yr2014,
             User_Group = "Total User")

# Merge group and total data
plot2_data <- bind_rows(plot2_df2, plot2_df3) |>
  mutate(User_Group = factor(User_Group, 
         levels = c("Tobacco Only", "E-cigarette Only", "Dual Users", "Total User")))
# --- Visualization for Graph 2 ---
Graph2 <- 
ggplot(plot2_data, aes(x = User_Group, y = net_change, fill = User_Group)) +
  # Represent user fluctuations using net change volume
  geom_bar(stat = "identity",width = 0.6,alpha = 0.9,
           color = "white",linewidth = 0.5) +
  # Add zero baseline for growth/decline reference
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  # Add labels with signs and growth rates
  geom_text(aes(label = paste0(ifelse(net_change > 0, "+", ""),
                               round(net_change, 1), " M\n", "(",
                               percent(growth_rate, accuracy = 1), ")")), 
            # Adjust label position for visual clarity and aesthetics
            vjust = ifelse(plot2_data$net_change > 0, -0.4, 1.2),
            lineheight = 1, fontface = "bold", size = 3.5) +
  
  # Manual color palette for user categories
  scale_fill_manual(values = c(
    "Tobacco Only" = "#ff7f00", 
    "E-cigarette Only"  = "#6baed6", 
    "Dual Users"   = "#807dba", 
    "Total User"   = "#848584FF")) +
  
  # Format Y-axis and expand space for labels
  scale_y_continuous(
    labels = function(x) paste0(x, " M"),
    expand = expansion(mult = c(0.25, 0.25))) +
  
  # Name the plot headers and axis titles
  labs(title = "Graph 2: Decadal Shift in Nicotine Product User",
       subtitle = "Net Change and Growth Rate (2014-2024)",
       x = "", 
       y = "Net Change (Millions)" ) +
  
 # Overall theme and aesthetic adjustments 
  theme_minimal() +
  theme(legend.position = "none",
        text = element_text(family = "serif"),
        plot.title = element_text(face = "bold", size = 15),
        # Clean up grid lines
        panel.grid.major.x = element_blank(),
        panel.grid.minor = element_blank())

Graph2

The chart reveals a clear divergence across user groups. The “Tobacco Only” population declined by approximately 13.0 million users (−39%), while the “E-cigarette Only” group expanded rapidly, increasing by 10.7 million (+413%). As a result, the total number of nicotine users declined only modestly, by around 4.1 million (−10%).

These findings indicate that reductions in combustible cigarette use have not translated into a proportional decline in nicotine dependence. Instead, the data point to a process of structural substitution, in which declining tobacco use is offset by the uptake of alternative nicotine products. This pattern suggests that e-cigarettes are often used alongside rather than fully replacing combustible cigarettes (Rass et al., 2015).

Graph 3: Age Density Profile of E-Cigarette Users

Graph 3 presents the age distribution of “E-cigarette Only” users in 2024, using weighted density plots disaggregated by gender. The inclusion of tercile markers allows for a structured comparison of age profiles across groups.

# --- Prepare data for Graph 3 ---

# install.packages("Hmisc") in console

# Extract 2024 e-cigarette users and clean gender/age data
plot3_df1_ecig <- smoker_fact |>
  filter(YEAR == 2024, User_Group == "E-cigarette Only") |>
  filter(!is.na(AGE), !is.na(Gender)) 

# Create comparison categories to examine gender disparities in usage
plot3_df2_t <- plot3_df1_ecig |> mutate(Category = "Total Users")        
plot3_df3_f <- plot3_df1_ecig |> filter(Gender == "Female") |> 
  mutate(Category = "Female") 
plot3_df4_m <- plot3_df1_ecig |> filter(Gender == "Male") |> 
  mutate(Category = "Male")

# Merge gender and total data for disparity analysis
plot3_data <- bind_rows(plot3_df2_t, plot3_df4_m, plot3_df3_f) |>
  mutate(Category = factor(Category, levels = c("Female", "Male", 
                                                "Total Users")))

# Calculate weighted terciles to analyze gender-age group disparities
plot3_df5_ter <- plot3_data |>group_by(Category) |>
  # Get 33% and 67% age points using weighted logic
  reframe(age_val = Hmisc::wtd.quantile(AGE, weights = SAMPWEIGHT, 
                                        probs = c(1/3, 2/3)))
# Load ggridges to visualize age distributions and comparative density across groups
library(ggridges)
library(Hmisc)

Graph3 <- 
ggplot(plot3_data, aes(x = AGE, y = Category, 
                       fill = Category, weight = SAMPWEIGHT)) +
  # Plot density ridges to locate age peaks
  stat_density_ridges(quantile_lines = TRUE, quantiles = 3, 
                      color = "gray30",alpha = 0.65, scale = 1.2,
                      linewidth = 0.5, rel_min_height = 0.005,
                      vline_linetype = "dashed", vline_width = 0.6) +
  
  # Add numeric labels for terciles
  geom_text(data = plot3_df5_ter, 
            aes(x = age_val, y = Category, label = round(age_val, 0)),
            # Prevent fill/weight conflicts
            inherit.aes = FALSE,
            vjust = -0.8, size = 3.5, family = "serif") +
  
  # Manual color palette for gender and total categories
  scale_fill_manual(values = c("Total Users" = "#848584FF", 
                               "Male" = "#56B4E9",  
                               "Female" = "#ffb703")) +
  
  # Set X-axis range for adult population (18-80)
  scale_x_continuous(breaks = seq(18, 80, by = 10), limits = c(18, 80)) +
  
  labs(title = "Graph 3: Age and Gender Patterns of E-Cigarette Users (2024)",
       caption = "Note: Vertical lines denote terciles, dividing users into equal thirds.",
       x = "Age", y = "") +
  
  # Overall theme and aesthetic adjustments
  theme_minimal() +
theme(legend.position = "none", 
        text = element_text(family = "serif"), 
        plot.title = element_text(face = "bold", size = 15),
        plot.caption = element_text(face = "italic", 
                                    size = 10, hjust = 0, color = "gray30"),
        axis.text.y = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank())

Graph3

The distribution shows a clear concentration of users in younger age groups, with the lower tercile falling in the mid-twenties (approximately age 25–26). This supports prior research indicating that younger adults often perceive e-cigarettes as modern and socially acceptable consumer products, rather than as traditional tobacco (Case et al., 2016).

A notable feature of the graph is the strong overlap between male and female distributions. The curves are highly similar in shape, indicating minimal gender difference in the age profile of use. This contrasts with historically gendered patterns of cigarette smoking.

This convergence aligns with emerging evidence of increasing uptake among women and a narrowing gender gap in younger populations (Kurti, 2020; Bleifertaitė et al., 2024). Taken together, the findings suggest that e-cigarette use is becoming more evenly distributed across genders, particularly among younger cohorts, pointing toward a more generalised pattern of nicotine uptake.

2.4 Conclusion

This analysis highlights a transformation in nicotine consumption patterns over the past decade. While traditional cigarette use has declined substantially, this reduction has only partially translated into a decrease in overall nicotine use. Instead, e-cigarettes have emerged as a major alternative, reshaping rather than reducing population-level dependence.

Across the three visualisations, a consistent pattern emerges: traditional cigarette use is decreasing, but this decline is partially offset by the rapid expansion of e-cigarette use and the persistence of dual use. As a result, nicotine dependence has not disappeared, but has instead been redistributed across different forms of consumption, evolving into a structurally gender-neutral phenomenon increasingly driven by younger demographics.

Taken together, these findings suggest that recent trends reflect a process of structural transition rather than a complete reduction in nicotine use, highlighting the limits of interpreting declining smoking rates as a straightforward public health success.

References

Bleifertaitė, R., Leonavičiūtė-Klimantavičienė, M. and Petraitienė, S. (2024) ‘Understanding gender differences in e-cigarette use among adolescents: survey findings’, Journal of Public Health. Advance online publication.

Blewett, L.A., Rivera Drew, J.A., King, M.L., Williams, K.C., Delesevic, A. and Ruggles, S. (2023) IPUMS Health Surveys: National Health Interview Survey, Version 7.3 [data set]. Minneapolis, MN: IPUMS.

Case, K.R., Clendennen, S.L., Kelder, S.H., Crook, B., Perez, A. and Wilkinson, A.V. (2016) ‘Formative research to identify perceptions of e-cigarettes in college students: Implications for future health communication campaigns’, Journal of American College Health, 64(5), pp. 380–389.

Harrell, F.E. (2024) Hmisc: Harrell Miscellaneous. R package version 5.1-2. Available at: https://CRAN.R-project.org/package=Hmisc (Accessed: 10 April 2026).

Kurti, A.N. (2020) ‘Reducing tobacco use among women of childbearing age: Contributions of tobacco regulatory science and tobacco control’, Experimental and Clinical Psychopharmacology, 28(5), pp. 501–516.

Rass, O., Pacek, L.R., Johnson, P.S. and Johnson, M.W. (2015) ‘Characterizing use patterns and perceptions of relative harm in dual users of electronic and tobacco cigarettes’, Experimental and Clinical Psychopharmacology, 23(6), pp. 494–503.

Samet, J.M. and Barrington-Trimis, J. (2021) ‘E-cigarettes and harm reduction: An artificial controversy instead of evidence and a well-framed decision context’, American Journal of Public Health, 111(9), pp. 1572–1574.

Part 3 - Consideration of AI Responses

The following section documents the interaction with ChatGPT aimed at improving the baseline R code. The original script utilized in this exercise is sourced from the second visualization, ‘Organs’, provided on the Week 9 Canvas page for SOCI523.

3.1 Interaction Transcript

==PASTED TEXT BEGINS HERE==

  • Me: Hello. I am working on an assignment about data visualization using R, and I have a baseline code that creates a dot plot for organ donation rates using the socviz dataset. I would like you to improve the code based on the following Requests:
  1. Axis Clarity: Please update the x-axis label to “Donor Rate (PMP: Per Million Population)” so the scale is clearly defined.
  2. Your Improvement: please provide one additional optimization (e.g., aesthetic, functional, or analytical/narrative) that you think would make this plot for a sociology report, and explain why you chose to add it.
  • AI (ChatGPT): What I changed and why:
  1. Axis clarity (your request): Updated x = “Donor Rate (PMP: Per Million Population)”. In sociology and public health reporting, clarity of measurement is essential. Adding “PMP” ensures the reader immediately understands the unit.
  2. Additional Improvement (Analytical): Added a vertical reference line for the overall mean using geom_vline. Why this improves the plot: It provides a visual benchmark. Without it, it’s harder to interpret how high or low a country’s rate really is. It strengthens the narrative argument by visually reinforcing the claim that presumed consent leads to higher donation rates (clustering relative to the mean).
  3. Optional further refinement: Use annotations to highlight key countries (e.g., Spain).

==PASTED TEXT ENDS HERE==

3.2 Critical Reflection on AI Performance

To evaluate the AI’s utility in sociological data visualization, I provided a baseline R script alongside two specific requests: one requiring an explicit technical code modification, and the other an open-ended prompt encouraging an “aesthetic, functional, or analytical/narrative” improvement.

The AI successfully executed the explicit technical instruction by standardizing the x-axis to “PMP.” As it correctly noted, raw numbers in public health data are ambiguous, and clarifying the metric is essential for professional reporting.

In responding to the open-ended request, the AI prioritized a structural, analytical intervention. By incorporating a vertical reference line (geom_vline) for the overall mean, the AI transformed the dot plot from a simple ranking into a comparative benchmark. Its justification—that a visual threshold helps policymakers immediately distinguish high-performing ‘opt-out’ systems from below-average ones—demonstrates an understanding of how visual heuristics function in sociological arguments.

Furthermore, the AI demonstrated narrative awareness to some degree. In its optional suggestions, it independently identified the necessity of highlighting “key countries (e.g., Spain)” using shapes or annotations. Even without explicit background information provided, the AI identified Spain as the most extreme case in the dataset. However, from a critical perspective, this might be a default algorithmic tendency, such as a programmed heuristic to highlight maximum or minimum values, rather than genuine contextual comprehension.

In conclusion, this interaction illustrates that AI excels at executing explicit technical and coding modifications. Yet, when addressing narrative-related demands, its perception remains relatively rudimentary. Bridging the gap between raw statistical data and persuasive policy communication ultimately requires human researchers to define and finalize the narrative focus.