Smoking Ban Policies and County-level Economic factors

Data 110 Final Project

Introduction

Topic Overview

Smoking bans are policies designed to regulate or prohibit smoking in public or in private spaces to promote public health. These policies aim to reduce exposure to secondhand smoke, discourage smoking behavior, and improve overall community health. While the health benefits of smoking bans are well-documented, their broader socioeconomic impacts, such as income, education, and demographic disparities, remain less understood. My goal in this project will be to uncover those certain relationships between smoking bans and county-level economic and demographic factors to find more insights and potential patterns.

Data Source

The dataset, “County-Level Smoking Ban Dataset,” was collected from http://quickfacts.census.gov/qfd/states/ (no longer available). The Dataset was obtained from a publicly available data-sharing repository, but it does not include a ReadMe file, Therefore, the exact context for the creation of this dataset is unknown, Although potential reasons for collecting this data might just be connected to evaluating the impact of smoking ban policies on public health outcomes, such as evaluation of smoking ban impacts, economic effects, demographic effects. Below is more of an elaborate explanation:

The potential reasons for collecting this data likely include:

Evaluating the impact of smoking bans on public health outcomes, such as smoking rates and the prevalence of smoking-related diseases.
Analyzing economic effects, including the sales of smoking-related products and their relationship to poverty or income levels.
Studying demographic disparities, such as differences in smoking behavior or education levels, across counties with varying smoking bans.

Methodology

While no detailed methodology was provided, certain inferences can be made about how the data was collected:

Socioeconomic and Demographic Data:

Variables such as median household income, poverty rates, and educational attainment were likely derived from census data or national surveys conducted by federal agencies.

Policy Documentation:

Information about smoking bans was probably sourced from government ordinances, public health regulations, or state- and county-level legislative records. This data likely includes the type and extent of smoking bans implemented in each county.

Why This Topic Matters

Smoking bans are an important public health measure, but their socioeconomic implications are often overlooked. This dataset offers an opportunity to explore these connections, making it personally meaningful as it bridges my interest in Analyzing Data. Understanding the broader impact of smoking bans on communities can inform future policy decisions and ensure equitable outcomes across different demographic and socioeconomic groups.

Key Variables

The Dataset includes the following key variables:

Categorical Variables:

Smoking_ban : Type of smoking ban policy (none, partial, comprehensive).
State : State name.
Education_group : Grouped education attainment levels based on bachelor’s degree percentages.

Quantitative Variables:

Sales_per_capita: Average sales per capita for smoking-related products (USD).
Median_Household_income: Median income at the county level (USD).
Poverty : Poverty percentage at the county level.
Bachelor s: Percentage of the population with a bachelor’s degree.
Density: Population density (people per square mile).
Age_under_18: Percentage of the population under 18 years old.
Age_over_65: Percentage of the population over 65 years old.

Preparation

To analyze the dataset effectively I performed the following steps:

Selected only relevant columns from the original dataset such as smoking bans, socioeconomic indicators, and demographic data.
Cleaned column names by converting them to lowercase and replacing spaces with underscores for consistency.
Checked for and addressed missing values to maintain a clean analysis.
Summarized key economic and demographic variables across different smoking ban types.
Examined initial patterns and relationships between variables such as median household income, poverty rates, and smoking-related sales per capita.
Grouped and categorized data such as creating education groups based on the percentage of the population with bachelor’s degrees. This allowed me to dig deeper in exploring how education might correlate with economic outcomes of smoking-related.

Objective

The main objective of this project is to uncover potential relationships between smoking bans and county-level economic and demographic factors, to explore not just the health implications of smoking bans but also their socioeconomic dimensions, aligning with my interest in analyzing data to address real-world challenges and uncover patterns in public health and policy data. Specifically, the project aims to answer the following questions:

Impact of Smoking Bans:

How do smoking ban policies (none, partial, comprehensive) correlate with economic factors such as median household income, poverty rates, and sales of smoking-related products?

Socioeconomic Disparities:

Are counties with different types of smoking bans (e.g., partial) associated with varying levels of education, population density, or household income?

Cleaning 1.0 :

setwd("/Users/ronaldohernandez/Desktop/Data 110 folder/Final Project")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load dataset
meow <- readr::read_csv ("county_w_sm_banfinaldataset.csv")

## Rows: 5999 Columns: 54
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): name, state, smoking_ban
## dbl (51): fips, pop2010, pop2000, age_under_5, age_under_18, age_over_65, fe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Inspect dataset structure and first few rows
glimpse(meow)

## Rows: 5,999
## Columns: 54
## $ name                                      <chr> "Abbeville County", "Acadia …
## $ state                                     <chr> "South Carolina", "Louisiana…
## $ fips                                      <dbl> 45001, 22001, 51001, 16001, …
## $ pop2010                                   <dbl> 25417, 61773, 33164, 392365,…
## $ pop2000                                   <dbl> 26167, 58861, 38305, 300904,…
## $ age_under_5                               <dbl> 6.0, 7.6, 5.9, 7.2, 5.6, 5.6…
## $ age_under_18                              <dbl> 22.8, 27.3, 20.9, 26.4, 19.4…
## $ age_over_65                               <dbl> 16.5, 12.8, 19.1, 10.5, 12.5…
## $ female                                    <dbl> 51.5, 51.2, 51.3, 49.9, 52.3…
## $ white                                     <dbl> 69.6, 79.5, 65.3, 90.3, 94.0…
## $ black                                     <dbl> 28.3, 18.1, 28.1, 1.1, 1.6, …
## $ native                                    <dbl> 0.2, 0.3, 0.4, 0.7, 0.3, 0.3…
## $ asian                                     <dbl> 0.3, 0.2, 0.6, 2.4, 1.8, 1.8…
## $ pac_isl                                   <dbl> NA, NA, 0.1, 0.2, 0.1, 0.1, …
## $ two_plus_races                            <dbl> 1.1, 1.3, 1.6, 2.8, 1.7, 1.7…
## $ hispanic                                  <dbl> 1.0, 1.7, 8.6, 7.1, 2.0, 2.0…
## $ white_not_hispanic                        <dbl> 69.1, 78.6, 61.1, 86.5, 92.8…
## $ no_move_in_one_plus_year                  <dbl> 88.9, 87.9, 91.5, 79.2, 72.9…
## $ foreign_born                              <dbl> 1.4, 0.7, 6.0, 5.8, 2.5, 2.5…
## $ foreign_spoken_at_home                    <dbl> 2.9, 16.4, 8.7, 9.0, 3.9, 3.…
## $ hs_grad                                   <dbl> 76.8, 69.5, 78.9, 92.9, 88.1…
## $ bachelors                                 <dbl> 15.3, 10.9, 18.0, 35.0, 25.5…
## $ veterans                                  <dbl> 2151, 3495, 3378, 32352, 172…
## $ mean_work_travel                          <dbl> 25.1, 28.9, 20.0, 19.3, 13.9…
## $ housing_units                             <dbl> 12079, 25387, 21002, 159471,…
## $ home_ownership                            <dbl> 77.4, 69.8, 74.1, 69.6, 61.1…
## $ housing_multi_unit                        <dbl> 7.7, 7.1, 5.6, 18.0, 24.3, 2…
## $ median_val_owner_occupied                 <dbl> 85900, 86700, 149800, 214500…
## $ households                                <dbl> 9875, 21984, 14085, 145584, …
## $ persons_per_household                     <dbl> 2.49, 2.75, 2.39, 2.55, 2.33…
## $ per_capita_income                         <dbl> 16653, 18116, 22766, 27915, …
## $ median_household_income                   <dbl> 33143, 37261, 41372, 55835, …
## $ poverty                                   <dbl> 20.7, 20.1, 15.6, 10.2, 26.0…
## $ private_nonfarm_establishments            <dbl> 356, 1114, 818, 12394, 700, …
## $ private_nonfarm_employment                <dbl> 4713, 12485, 9424, 166239, 8…
## $ percent_change_private_nonfarm_employment <dbl> -29.4, 4.9, 8.6, 8.5, -5.4, …
## $ nonemployment_establishments              <dbl> 1494, 3681, 2267, 29533, 146…
## $ firms                                     <dbl> 1385, 4289, 2944, 42344, 206…
## $ black_owned_firms                         <dbl> 19.1, NA, 6.0, 0.4, NA, NA, …
## $ native_owned_firms                        <dbl> NA, NA, NA, 1.0, NA, NA, NA,…
## $ asian_owned_firms                         <dbl> NA, NA, NA, 1.3, 1.4, 1.4, 1…
## $ pac_isl_owned_firms                       <dbl> NA, NA, NA, NA, NA, NA, NA, …
## $ hispanic_owned_firms                      <dbl> NA, 1.4, NA, 2.1, NA, NA, NA…
## $ women_owned_firms                         <dbl> 33.4, 25.4, 23.1, 25.4, 28.2…
## $ manufacturer_shipments_2007               <dbl> 657498, NA, 526157, 4942388,…
## $ mercent_whole_sales_2007                  <dbl> NA, NA, 59400, 6006918, 5474…
## $ sales                                     <dbl> 71936, 525956, 298001, 58551…
## $ sales_per_capita                          <dbl> 2841, 8808, 7749, 15720, 125…
## $ accommodation_food_service                <dbl> 10963, 40790, 48144, 795953,…
## $ building_permits                          <dbl> 19, 108, 89, 1285, 17, 17, 1…
## $ fed_spending                              <dbl> 169972, 459879, 449275, 3122…
## $ area                                      <dbl> 490.48, 655.12, 449.50, 1052…
## $ density                                   <dbl> 51.8, 94.3, 73.8, 372.8, 45.…
## $ smoking_ban                               <chr> "none", "partial", "none", "…

# Select only relevant columns for the analysis
meow2.0 <- meow %>%
  select(name,state, smoking_ban, sales_per_capita, density, pop2010, age_under_18, age_over_65,median_household_income,poverty,bachelors,white,black,native,asian,hispanic,hs_grad,bachelors)

# Clean column names: lowercase and replace spaces with underscores
names(meow2.0) <- tolower(names(meow2.0))
names(meow2.0) <- gsub(" ", "_", names(meow2.0))

sum(is.na(meow2.0))

## [1] 186

meow3.0 <- drop_na(meow2.0)

sum(is.na(meow3.0))

## [1] 0

dim(meow3.0)

## [1] 5816   17

Initial decode of data:

How does sales_per_capita vary across smoking_ban types? Is there a connection between smoking_ban and median_household_income or poverty rates?

# Summarize key economic, demographic, and racial metrics by smoking ban type (https://dplyr.tidyverse.org/reference/summarise.html)
meow3.0mean <- meow3.0 %>%
  group_by(state,name,smoking_ban) %>%
  summarize(
    sales_per_capita = mean(sales_per_capita, na.rm = TRUE),
    density = mean(density, na.rm = TRUE),
    pop2010 = mean(pop2010, na.rm = TRUE),
    age_under_18 = mean(age_under_18, na.rm = TRUE),
    age_over_65 = mean(age_over_65, na.rm = TRUE),
    median_household_income = mean(median_household_income, na.rm = TRUE),
    poverty = mean(poverty, na.rm = TRUE),
    bachelors = mean(bachelors, na.rm = TRUE),
    white = mean(white, na.rm = TRUE),
    black = mean(black, na.rm = TRUE),
    native = mean(native, na.rm = TRUE),
    asian = mean(asian, na.rm = TRUE),
    hispanic = mean(hispanic, na.rm = TRUE),
    hs_grad = mean(hs_grad, na.rm = TRUE),
    bachelors = mean(hs_grad, na.rm = TRUE)
  ) %>%
ungroup() #pipeline to remove grouping structure

## `summarise()` has grouped output by 'state', 'name'. You can override using the
## `.groups` argument.

First Visualization:

ggplot(meow3.0mean,
       aes(x = median_household_income,y = sales_per_capita,color = smoking_ban)) +
  geom_point(size = 2, alpha = 0.7) +  # Adjust point size and transparency
  geom_smooth(method = "lm", se = FALSE, aes(linetype = smoking_ban), size = 1) +
  scale_color_manual(values = c("comprehensive" = "black",  
                                "none" = "purple",         
                                "partial" = "orange")) +    
  labs(
    title = "Median Household Income and average of smoking-related products sold per person",
    subtitle = "Comparison by Smoking Ban Type",
    x = "Median Household Income (USD)",
    y = "Avg Smoking products per person (USD)",
    color = "Smoking Ban Type"
  ) +
  theme_classic() +
  theme(
  plot.title = element_text(size = 11, face = "bold", hjust = 0., margin = margin(t = 10, b = 10)),
  plot.subtitle = element_text(size = 10, hjust = 0),
)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

(e.g., “Data Source: County-Level Smoking Ban Dataset”).

What It Represents:

This scatterplot visualizes the relationship between median household income (x-axis) and smoking-related product sales per capita (y-axis) across counties. Points are color-coded based on the type of smoking ban (comprehensive, partial, none), and regression lines are added for each ban type to highlight trends.

What we learn from this visualization:

Counties with higher median household income generally tend to have lower smoking-related product sales. However, this trend varies slightly depending on the smoking ban type.
Counties with no bans seem to have higher in sales per capita, with some outliers showing unexpectedly high sales despite lower income levels.
Some counties with partial bans have notably high sales per capita regardless of their median income. These outliers may warrant further investigation into additional contributing factors, such as education levels or population density.

Cleaning 2.0:

meowislclean <- meow3.0 %>%
  filter(smoking_ban == "partial") %>%
  mutate(education_group = cut(
    bachelors,
    breaks = c(0, 20, 40, 60, 100),  
    labels = c("Low", "Mid", "High", "Very High")       #https://ssc.wisc.edu/sscc/pubs/R_intro/book/3-6-creatingchanging-variables.html
  ))

Second Visualization:

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

# bubble chart
bubble_chart <- ggplot(meowislclean, aes(x = median_household_income, y = bachelors, size = sales_per_capita, color = education_group)) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Smoking Sales vs. Education Levels and Income",
    x = "Median Household Income (USD)",
    y = "Bachelor's Degree (%)",
    size = "Sales Per Capita",
    color = "Education Group"
  ) +
  scale_x_continuous(labels = comma) +  # Format x-axis with commas (https://stackoverflow.com/questions/37713351/formatting-ggplot2-axis-labels-with-commas-and-k-mm-if-i-already-have-a-y-sc)
  theme_minimal()

ggplotly(bubble_chart)

(e.g., “Data Source: County-Level Smoking Ban Dataset”).

What It Represents:

The Smoking Sales vs. Education Levels and Income” bubblecharr provides a visual representation of the relationship between median household income, education levels (bachelor’s degree attainment), and smoking-related sales per capita.

Key Features:
1. X-Axis: Represents median household income (in USD) at the county level, indicating the economic status of the county.
2. Y-Axis: Represents the percentage of individuals in the county with a bachelor’s degree, which serves as a proxy for education levels.
3. Bubble Size: Corresponds to sales per capita for smoking-related products, indicating the magnitude of smoking-related sales in the county. Larger bubbles mean higher sales.
4. Bubble Color: Indicates the education group (Low, Mid, High, Very High), providing an additional categorical layer to the visualization.

What we learn from this visualization:

Counties with higher median household income tend to have a higher percentage of residents with a bachelor’s degree (as seen with bubbles in the top-right quadrant).
This suggests a positive correlation between income and education levels, reinforcing the relationship between economic prosperity and educational attainment.

Statistical Analysis:

The goal of my analysis is to identify and understand how various factors—such as median household income, poverty, education level (bachelor’s degree percentage), and population density—influence sales per capita for smoking-related products . The analysis aimed to quantify these relationships and determine their significance using a multiple linear regression model.

Model Equation:

The regression model is defined as follows: Equation : Sales_Per_Capita = β0 + β1(Median_Household_Income) + β2(Poverty) + β3(Bachelors) + β4(Density)

Meaning:

β0: Intercept
β0,β1,β2,β3,β4: Coefficients representing the relationship between sales per capita and the predictor variables.

# Multiple Linear Regression Model
model <- lm(sales_per_capita ~ median_household_income + poverty + bachelors + density, data = meow3.0)
summary(model)

## 
## Call:
## lm(formula = sales_per_capita ~ median_household_income + poverty + 
##     bachelors + density, data = meow3.0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12859  -2668   -411   2203  62399 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              8.239e+03  5.521e+02  14.921  < 2e-16 ***
## median_household_income -1.606e-02  9.944e-03  -1.615   0.1064    
## poverty                 -9.487e+01  1.450e+01  -6.541 6.63e-11 ***
## bachelors                2.143e+02  9.189e+00  23.327  < 2e-16 ***
## density                  2.228e-01  8.670e-02   2.569   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4084 on 5811 degrees of freedom
## Multiple R-squared:  0.1799, Adjusted R-squared:  0.1793 
## F-statistic: 318.7 on 4 and 5811 DF,  p-value: < 2.2e-16

Explanation of Adjusted R-Squared and P-Value:

Adjusted R-Squared:

The Adjusted R-Squared value is 0.1793, meaning that 17.93% of the variability in sales_per_capita (smoking-related product sales) is explained by the model.

Interpretation:

This is a relatively low value, indicating that the model does not capture much of the in sales_per_capita.
The low adjusted R-squared suggests that there are likely other important factors influencing sales_per_capita that were not included in this analysis (e.g., cultural factors, access to smoking products, local policies, or healthcare awareness).

Overall Model P-Value:

The model’s overall p-value is < 2.2e-16, indicating that the model as a whole is statistically significant.

Explanation of Coefficients:

Median Household Income:

Coefficient: -0.01606

Poverty:

Coefficient: -94.87

Bachelors (Education Level):

Coefficient: 214.3

Density (Population Density):

Coefficient: 0.2228

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(model, columns = 4:8)  # only include predictor variables in the matrix

par(mfrow = c(2, 2))  # Create a 2x2 grid for diagnostic plots
plot(model)

Final Thoughts:

Main Takeaways of my Project:

Income and Smoking Sales: Higher median household income is generally associated with lower smoking-related product sales.
Impact of Smoking Bans: Comprehensive bans are linked to lower smoking sales, while counties with no bans show higher sales. Partial bans show mixed results, with some counties having unexpectedly high sales.
Education’s Role: Counties with higher percentages of bachelor’s degree holders tend to have lower smoking-related sales, emphasizing the role of education in promoting healthier behaviors.
Limitations of the Model: While poverty and education were significant predictors, the low adjusted R-squared value indicates that many influential factors were not captured in the analysis.

Conclusion:

In my Project I highlight the complex relationships between smoking bans, socioeconomic factors, and smoking-related product sales. smoking bans are effective in reducing sales in many counties, although the data reveals disparities tied to income, education, and other unexplored factors. Addressing these disparities through targeted public health policies and education initiatives could improve the effectiveness of smoking bans and promote healthier communities. I’d like to thank professor Saidi for being an awesome guidance towards learning R. This class has deepened my appreciation for how data analysis can inform us and decode information in data.”

Data110FINALE

ronaldo Hernandez

12-14-2024

Smoking Ban Policies and County-level Economic factors

Data 110 Final Project

Introduction

Topic Overview

Data Source

The potential reasons for collecting this data likely include:

Methodology

Socioeconomic and Demographic Data:

Policy Documentation:

Why This Topic Matters

Key Variables

Categorical Variables:

Quantitative Variables:

Preparation

Objective

Cleaning 1.0 :

Initial decode of data:

First Visualization:

What It Represents:

What we learn from this visualization:

Cleaning 2.0:

Second Visualization:

What It Represents:

Key Features:

What we learn from this visualization:

Statistical Analysis:

Model Equation:

Meaning:

Explanation of Adjusted R-Squared and P-Value:

Explanation of Coefficients:

Final Thoughts:

Main Takeaways of my Project:

Conclusion:

Data110FINALE

ronaldo Hernandez

12-14-2024

Smoking Ban Policies and County-level Economic factors

Data 110 Final Project

Introduction

Topic Overview

Data Source

The potential reasons for collecting this data likely include:

Methodology

Socioeconomic and Demographic Data:

Policy Documentation:

Why This Topic Matters

Key Variables

Categorical Variables:

Quantitative Variables:

Preparation

Objective

Cleaning 1.0 :

Initial decode of data:

First Visualization:

What It Represents:

What we learn from this visualization:

Cleaning 2.0:

Second Visualization:

What It Represents:

Key Features:

What we learn from this visualization:

Statistical Analysis:

Factors Influencing Smoking-Related Sales

Model Equation:

Meaning:

Explanation of Adjusted R-Squared and P-Value:

Explanation of Coefficients:

Final Thoughts:

Main Takeaways of my Project:

Conclusion: