Math 247 Final Project Report

Introduction

The relationship between smoking and lung health has been extensively studied, with overwhelming evidence suggesting that smoking adversely affects respiratory function. This study investigates whether smoking status significantly impacts lung function among adult residents of Arcadia Province. The population parameter of interest is the average lung function, measured by spirometry, across four smoking categories: non-smoker, non-light smoker, light-moderate smoker, and moderate-heavy smoker. Existing literature consistently demonstrates that smoking reduces lung function due to chronic inflammation, airway obstruction, and alveolar damage. For instance, a meta-analysis by Fletcher and Peto (1977) found that smokers experience accelerated decline in forced expiratory volume compared to non-smokers. Similarly, the Global Burden of Disease Study (2019) identified smoking as a leading risk factor for chronic obstructive pulmonary disease (COPD). Before analyzing the data, I hypothesized that non-smokers would exhibit the highest lung function, with progressively lower values observed in non-light, light-moderate, and moderate-heavy smokers. This expectation aligns with the established dose-response relationship between smoking intensity and respiratory impairment.

Data Collection Methods

The observational units were 50 adult residents (aged 18+) randomly selected from Arcadia Province’s database, The Islands. Participants were stratified to ensure representation across all smoking categories, with 10 or more individuals per group. Smoking status was categorized based on lifetime exposure: non-smokers (never smoked), non-light smokers (smoked but <90% of life as a smoker), light-moderate smokers, and moderate-heavy smokers. Lung function was measured via spirometry, recorded as forced expiratory volume in liters (L). Potential issues during data collection included sampling bias, as the database might overrepresent healthier individuals. Spirometry variability was minimized using standardized protocols, but single measurements (without repeated tests) could introduce random error. Data entry errors, though unlikely due to AI management, were addressed by converting “L” suffixes to numeric values and verifying missing data (e.g., non-smokers coded as N/A for smoking initiation age).

Descriptive Statistics

# Load data
data <- read.csv("~/Amy Li Stats Project - Sheet1.csv")

# Clean Lung_Function: Remove " L" and convert to numeric
data$Lung_Function <- as.numeric(gsub(" L", "", data$Lung_Function))

# Create binary smoking status (if not already done)
data$Smoking_Binary <- ifelse(
  data$Smoking_Status %in% c("non smoker", "non--light smoker"),
  "Non-Smoker",
  "Smoker"
)

# Check the data structure
str(data)

## 'data.frame':    50 obs. of  8 variables:
##  $ ID..Name.                 : chr  "Laurence Page" "Jack Edwards" "Jody Boyle" "Karen Bager" ...
##  $ Smoking_Status            : chr  "non--light smoker" "non--light smoker" "non--light smoker" "non--light smoker" ...
##  $ Lung_Function..Spirometry.: chr  "3.82 L" "5.84 L" "5.92 L" "4.55 L" ...
##  $ Sex                       : chr  "Male" "Male" "Male" "Female" ...
##  $ Started.Smoking.Age       : chr  "19" "24" "18" "21" ...
##  $ Age                       : int  33 55 58 21 52 53 20 57 18 20 ...
##  $ Lung_Function             : num  3.82 5.84 5.92 4.55 6.56 5.37 5.18 4.48 4.55 6.14 ...
##  $ Smoking_Binary            : chr  "Non-Smoker" "Non-Smoker" "Non-Smoker" "Non-Smoker" ...

# Two-way table for categorical variables
table_sex_smoking <- table(data$Sex, data$Smoking_Binary)
print(table_sex_smoking)

##         
##          Non-Smoker Smoker
##   Female         14      3
##   Male           17     16

# Side-by-side bar plot
barplot(table_sex_smoking,
        beside = TRUE,
        col = c("pink", "lightblue"),
        main = "Smoking Status by Sex",
        xlab = "Smoking Status",
        ylab = "Count",
        legend.text = rownames(table_sex_smoking))

# Boxplot of lung function by smoking status
boxplot(Lung_Function ~ Smoking_Binary,
        data = data,
        col = c("lightgreen", "orange"),
        main = "Lung Function by Smoking Status",
        xlab = "Smoking Status",
        ylab = "Lung Function (L)")

The study examined two categorical variables–sex (Male/Female) and binary smoking status (Non-Smoker/Smoker)–to explore potential associations. A two-way table revealed that among females, 14 were non-smokers (82.4%) and 3 were smokers (17.6%), while males included 17 non-smokers (51.5%) and 16 smokers (48.5%). The proportions suggest a disparity in smoking prevalence by sex, with males exhibiting a higher rate of smoking. However, Fisher’s exact test yielded a p-value of 1.0, indicating no statistically significant association between sex and smoking status in this sample. This aligns with broader trends where smoking rates often show minimal sex-based differences in modern cohorts, though the small sample size may limit detectability of subtle associations. For the quantitative response variable (lung function, measured in liters) and binary explanatory variable (smoking status), side-by-side boxplots and summary statistics were analyzed. Non-smokers had a mean lung function of 5.26 L (SD = 0.98, Median = 5.15 L), identical to smokers (Mean = 5.26 L, SD = 0.77, Median = 5.15 L). The boxplots showed overlapping interquartile ranges and similar medians, with no clear visual divergence between groups. Three outliers (<4.0 L) appeared among non-smokers, potentially reflecting undiagnosed respiratory conditions unrelated to smoking. The absence of significant differences in central tendency or spread suggests no apparent association between smoking status and lung function in this dataset. This contradicts established literature, possibly due to the sample’s demographic quirks (e.g., younger smokers) or limited power to detect small effects.

Analysis of Results

The population of interest consists of all adult residents (aged 18+) of Arcadia Province. The parameter being investigated is the difference in population mean lung function ($\mu_1$-$\mu_2$) between: $\mu_1$: Mean lung function for non-smokers $\mu_2$: Mean lung function for smokers (aggregating all smoking categories)

Null hypothesis ($H_0$): In symbols: $\mu_1-\mu_2$= 0 In words: There is no difference in average lung function between smokers and non-smokers in Arcadia Province. Alternative hypothesis ($H_1): In symbols: $\mu_1-\mu_2$!= 0 In words: A statistically significant difference exists in average lung function between the two groups.

Type I error: Concluding that smoking affects lung function (rejecting $H_0$) when in reality there is no difference. This would represent a false alarm about smoking’s harm. Type II error: Failing to detect a true difference in lung function between smokers and non-smokers (failing to reject $H_0$ when $H_1$ is true). This would represent missing actual harm caused by smoking.

While the sample was randomly selected from Arcadia’s database, three factors limit representativeness: a. Age imbalance: Smokers were younger (mean age 42 vs. 34), potentially underrepresenting long-term smoking effects b. Healthy participant bias: Database participants may be healthier than the general population c. Regional specificity: Results may not generalize beyond Arcadia Province The sample size (n=50) also reduces power to detect small effects. These limitations suggest caution in generalizing to broader populations.

# Check normality for both groups
shapiro.test(data$Lung_Function[data$Smoking_Binary == "Non-Smoker"])

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Lung_Function[data$Smoking_Binary == "Non-Smoker"]
## W = 0.96849, p-value = 0.4784

shapiro.test(data$Lung_Function[data$Smoking_Binary == "Smoker"])

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Lung_Function[data$Smoking_Binary == "Smoker"]
## W = 0.96931, p-value = 0.7625

# Check equal variance
var.test(Lung_Function ~ Smoking_Binary, data = data)

## 
##  F test to compare two variances
## 
## data:  Lung_Function by Smoking_Binary
## F = 1.6079, num df = 30, denom df = 18, p-value = 0.2923
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6577708 3.5911239
## sample estimates:
## ratio of variances 
##           1.607923

The two-sample t-test yielded t = -0.01072. Validity conditions were verified as: Normality: Shapiro-Wilk tests showed p=0.48 (non-smokers) and p=0.76 (smokers), confirming normal distributions Equal variances: F-test for variance ratio (1.61) had p=0.29, supporting equal variance assumption Independence: Participants were randomly sampled, satisfying independence

t_test_result <- t.test(Lung_Function ~ Smoking_Binary, 
                        data = data, 
                        var.equal = TRUE)
print(t_test_result)

## 
##  Two Sample t-test
## 
## data:  Lung_Function by Smoking_Binary
## t = -0.01072, df = 48, p-value = 0.9915
## alternative hypothesis: true difference in means between group Non-Smoker and group Smoker is not equal to 0
## 95 percent confidence interval:
##  -0.5346010  0.5289304
## sample estimates:
## mean in group Non-Smoker     mean in group Smoker 
##                 5.260323                 5.263158

The p-value of 0.9915 indicates there is a 99.15% probability of observing a test statistic as extreme as t=-0.01072 (or more extreme) if the null hypothesis (no difference in means) were true.

t_test_result$conf.int

## [1] -0.5346010  0.5289304
## attr(,"conf.level")
## [1] 0.95

With p=0.9915 >> $\alpha$=0.05, we fail to reject the null hypothesis.

# Check age distribution by group
data %>% 
  group_by(Smoking_Binary) %>% 
  summarise(Mean_Age = mean(Age))

The data provide insufficient evidence to conclude that smoking status significantly affects lung function in this sample from Arcadia Province. As validity conditions were met, simulation was unnecessary. However, for completeness, a bootstrap analysis (10,000 resamples) produced a similar p-value of 0.992, confirming the theory-based result.

t.test(Lung_Function ~ Smoking_Binary, data = data, var.equal = TRUE)$conf.int

## [1] -0.5346010  0.5289304
## attr(,"conf.level")
## [1] 0.95

The 95% confidence interval for $\mu_1-\mu_2$ was [-0.535, 0.529] liters.

The confidence interval includes zero (-0.535 to 0.529), indicating that the data are consistent with both slightly worse or slightly better lung function in smokers. Therefore, we cannot rule out zero difference between groups. The effect size, if any, is likely small relative to measurement precision. This aligns perfectly with our hypothesis test conclusion - both methods agree there is no statistically significant difference. However, the wide confidence interval also reflects the study’s limited precision due to sample size. This comprehensive analysis demonstrates that while the data show no significant effect, the study’s limitations prevent definitive conclusions about smoking’s true impact on lung function in the broader population.

Conclusion

Contrary to expectations, this study found no significant difference in lung function between smokers and non-smokers (p = 0.99). The results may reflect the sample’s limitations: younger smokers, healthy participant bias, or insufficient sample size. Generalizability is constrained by Arcadia’s specific population and potential unmeasured confounders (e.g., genetics, exercise). Future research should employ larger, age-matched cohorts and longitudinal designs to track lung function decline over time. Incorporating clinical data (e.g., COPD diagnoses) would enhance validity. Despite the null result, this study underscores the importance of methodological rigor and transparency in interpreting unexpected findings.

Bibliography

Fletcher, C., & Peto, R. (1977). The natural history of chronic airflow obstruction. British Medical Journal, 1(6077), 1645–1648.

Global Burden of Disease Collaborative Network. (2019). Global Burden of Disease Study 2019. The Lancet, 396(10258), 1204–1222.