adult tobacco consumption in the U.S.

Author

Ayomide Joe-Adigwe

Introduction

Data Cleaning

# Checking for missing values
sum(is.na(Adult_Tobacco_Consumption_In_The_U_S_2000_Present))
[1] 0
# Removing rows with missing values
Adult_Tobacco_Consumption_In_The_U_S_2000_Present  <- na.omit(Adult_Tobacco_Consumption_In_The_U_S_2000_Present )

# Filter for relevant years (e.g., 2000 onwards)
Adult_Tobacco_Consumption_In_The_U_S_2000_Present  <- Adult_Tobacco_Consumption_In_The_U_S_2000_Present  %>% filter(Year >= 2000)

# Convert any necessary columns to appropriate data types
Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year <- as.numeric(Adult_Tobacco_Consumption_In_The_U_S_2000_Present $Year)

Linear Regression Analysis:

# Linear Regression: Predicting 'Total Per Capita' based on 'Year' and 'Total'
model <- lm(`Total Per Capita` ~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

# Display model summary
model_summary <- summary(model)
model_summary

Call:
lm(formula = `Total Per Capita` ~ Year + Total, data = Adult_Tobacco_Consumption_In_The_U_S_2000_Present)

Residuals:
   Min     1Q Median     3Q    Max 
-97.48 -11.27   0.96  14.73 163.05 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.171e+03  5.576e+02   7.481 7.73e-13 ***
Year        -2.076e+00  2.771e-01  -7.489 7.32e-13 ***
Total        4.360e-09  1.638e-11 266.223  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.75 on 309 degrees of freedom
Multiple R-squared:  0.9957,    Adjusted R-squared:  0.9957 
F-statistic: 3.593e+04 on 2 and 309 DF,  p-value: < 2.2e-16
# Extract p-values and adjusted R-squared
p_values <- coef(summary(model))[, "Pr(>|t|)"]
adjusted_r2 <- model_summary$adj.r.squared

# Display p-values and adjusted R-squared for analysis
p_values
 (Intercept)         Year        Total 
7.728110e-13 7.319974e-13 0.000000e+00 
adjusted_r2
[1] 0.9956911
# Regression equation
equation <- paste0("Total Per Capita = ", round(coef(model)[1], 2), 
                   " + ", round(coef(model)[2], 2), "*Year",
                   " + ", round(coef(model)[3], 2), "*Total")
equation
[1] "Total Per Capita = 4170.86 + -2.08*Year + 0*Total"

Diagnostic Plots

# Diagnostic Plots
# To avoid margin issues, reset plot layout and plot each diagnostic plot separately

# Plot: Residuals vs Fitted
par(mfrow = c(1, 1), mar = c(5, 5, 2, 2))  # Adjust margins
plot(model, which = 1)

# Plot: Normal Q-Q
plot(model, which = 2)

# Plot: Scale-Location (Homoscedasticity Check)
plot(model, which = 3)

# Plot: Residuals vs Leverage
plot(model, which = 5)

Create Scatterplot

# Create Heatmap
heatmap_plot <- ggplot(Adult_Tobacco_Consumption_In_The_U_S_2000_Present, aes(x = Year, y = Measure, fill = `Total Per Capita`)) +
  geom_tile() +
  labs(
    title = "Heatmap of Total Tobacco Consumption per Capita Over Time",
    x = "Year",
    y = "Type of Tobacco Product (Measure)",
    fill = "Total Per Capita",
    caption = "Data Source: CDC"
  ) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Adjust the color scale
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )

heatmap_plot

Conclusion and Analysis

a. How the Data Was Cleaned

The dataset was cleaned using several steps to ensure that the data was accurate and relevant for analysis. First, any missing data points were identified and removed using the na.omit() function. This ensured that no incomplete rows were present, which could have affected the analysis.

b. Visualization Interpretation

The primary visualization in this project was a heatmap, which illustrated the changes in Total Tobacco Consumption per Capita across different years and types of tobacco products (the Measure variable). The heatmap used color intensity to show higher or lower levels of consumption.

Several insights emerged from the heatmap:

  • General Trends: The heatmap revealed a noticeable decline in per capita tobacco consumption for several tobacco product categories over the years, particularly for combustible tobacco products like cigarettes.

  • Variation by Product: While some types of tobacco (e.g., cigarettes) showed a steady decline, non-combustible products like smokeless tobacco demonstrated more stable consumption patterns. This suggests that as smoking rates decrease, other tobacco products might be gaining popularity.

  • Unexpected Findings: In certain years, there was a sudden increase in consumption for specific products such as cigars, which could be attributed to changes in consumer behavior or product marketing.

The heatmap’s ability to visually represent multiple variables (year, product type, and consumption level) helped to uncover these patterns and allowed for easy comparison between product categories.

c. Challenges and Limitations