Table of Contents

  1. Introduction
  2. Exploratory Data Analysis
  3. Model Investigation
  4. Appendix

Part 1. Introduction

Description of Data Set

In 1973, a large cotton textile company in North Carolina participated in a study. The purpose of the study was to investigate the prevalence of byssinosis, a form of pneumoconiosis. Data was collected on 5,419 workers. Variables recorded include:

  • Type of work place [1 (most dusty), 2 (less dusty), 3 (least dusty)]

  • Employment, years [< 10, 10-19, 20-]

  • Smoking [Smoker, or not in last 5 years]

  • Sex [Male, Female]

  • Race [White, Other]

  • Byssinosis [Yes, No]

This data set is Binomial data.

Part 2. Exploratory Data Analysis

Visualizations

Our two graphs below compare elements of the dataset:

  • Out of the 5149 workers, 165 of them have Byssinosis (about 3%). This is a very low rate.

  • There are slighly more male workers (2916) than female workers (2503) at the company.

  • Most of the workers at the company are White (68%).

The first graph shows that more than half of the workers smoke. The majority of the smokers are male and the majority of the nonSmokers are female.

The second graph contrasts length of employment and sex. More workers have worked at the company for less than 10 years, compared to 10-19 or 20+ years. Interestingly, there are more workers in the <10 and 20+ range for employment than 10-19. This may suggest high turnover in the industry. If an employee worked a the company for over 10 years they are more likely to be a male.

Odds and Odds Ratios based off the Data Set

The odds ratio of Male v. Female having Byssinosis is 2.969488. Males are 2.97x more likely to have Byssinosis than Females.

The odds ratio of Smokers v. NonSmokers having Byssinosis is 2.185246. Meaning that Smokers are 2.19x more likley to have Byssinosis.

The odds of having Byssinosis based on Workspace [Most Dusty, Less Dusty, Least Dusty] are: 0.1569507, 0.01384615, 0.01217391. Dustier workspaces lead to higher odds of Byssinosis, but there is little difference in these odds for less and least dusty workspaces.

Part 3. Model Investigation

Additive Model

Using the Wald Test and \(\alpha\) = 0.05, We drop Sex and Race from our model, as both of their p-values are below our chosen significance level of 0.05. This leaves us with the predictors Workspace, Smoking, and Employment.

Fitted Model after Wald Test

## 
## Call:
## glm(formula = cbind(Byssinosis, Non.Byssinosis) ~ . - Race - 
##     Sex - total_workers, family = binomial(), data = bys)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6239  -0.7904  -0.1946   0.2679   1.6605  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.4546     0.2047 -11.992  < 2e-16 ***
## Employment>=20         0.6728     0.1813   3.710 0.000207 ***
## Employment10-19        0.5060     0.2490   2.032 0.042119 *  
## SmokingYes             0.6210     0.1908   3.255 0.001133 ** 
## WorkspaceLess Dusty   -2.5493     0.2614  -9.753  < 2e-16 ***
## WorkspaceLeast Dusty  -2.7175     0.1898 -14.314  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 322.527  on 64  degrees of freedom
## Residual deviance:  43.882  on 59  degrees of freedom
## AIC: 162.56
## 
## Number of Fisher Scoring iterations: 5

We see that for each year of employment, the probability of getting the disease increases. We also see that the probability of getting the disease decreases for less dusty workspaces, so higher dust increases chances of disease. Finally, we also see that smoking increases risk for the disease. The most impactful factor on risk for this disease is employment, as only a one year increase changes the odds by 0.5 or 0.6 each time, depending on which employment category the employee is in. This is different than smoking status, that only increases risk by 0.6 if someone smokes at all.

Odds and Odds Ratios based off the Fitted Model

\(log(\frac{\pi}{1 -\pi})=b0+b1*X1+b2*X2\)

b0 = summary(Model2)$coefficients[1,1]

b1 = summary(Model2)$coefficients[2,1]

b2 = summary(Model2)$coefficients[3,1]

b3 = summary(Model2)$coefficients[4,1]

b4 = summary(Model2)$coefficients[5,1]

b5 = summary(Model2)$coefficients[6,1]

AIC and BIC

Running Best Forward AIC, Best Backward AIC, and Best BiDirectional AIC all gave the same variables for the model: Employment, Smoking and Workspace. Running best Forward BIC, Best Backward BIC, and Best BiDirectional BIC all gave the same variables as well: Employment and Smoking. Both measures AIC and BIC support dropping the variables Sex and Race, but differ on whether to drop Workspace.

Conclusion

AIC and BIC disagree on whether to remove the Workspace predictor from our model, but since our Wald test kept the predictor at significance level 0.05, we choose to keep Workspace. This leaves us with three predictors in our model: Smoking, Workspace, and Employment. Predictors Sex and Race are insignificant predictors for this regression. We conclude that in order for an employee to minimize risk of Byssinosis in this workspace, they should not smoke, work in the least dusty workspace, and quit as soon as possible. The company may be liable for damages caused by the disease as these factors for the disease are under their control. Further analysis should be conducted by the government to determine liability.

Part _. Appendix

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = '~/Downloads/' )
library('ggplot2')
library('bestglm')
library('plyr')
library('gridExtra')
# load Byssinosis.csv into a table
bys = read.table('/Users/jon/Downloads/Byssinosis.csv', header = TRUE, sep = ",")

# Change workspace to be a factor not numeric
bys$Workspace = as.factor(bys$Workspace)
bys$Workspace = revalue(bys$Workspace, c("1" = "Most Dusty", "2" = "Less Dusty", "3" = "Least Dusty"))
# adds a new columns for total_workers
bys$total_workers = ""

# changes column type to numeric
bys$total_workers = as.integer(bys$total_workers)

# fill in total_workers column
i = 1
for (i in 1:nrow(bys))
  bys$total_workers[i] = sum(bys$Byssinosis[i]) + sum(bys$Non.Byssinosis[i])
 i = i +1
# workers total
sum(bys$Byssinosis) + sum(bys$Non.Byssinosis) 

# workers with Byssinosis
sum(bys$Byssinosis)
par(mfrow=c(1,2)) 

# Stacked barplot with multiple groups
plot1 = ggplot(data = bys, aes(x = Smoking, y = total_workers, fill = Sex)) +
  geom_bar(stat="identity") + 
  ylab('Total Number of Workers') + 
  labs(title = 'Number of Workers by
      Sex and Smoking Status', x = 'Smoking Status') + 
  # fill colors are colorblind safe and print friendly
  # http://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3
  scale_fill_manual(values=c("#1b9e77", "#d95f02"))

plot2 = ggplot(data = bys, aes(x = Employment, y = total_workers, fill = Sex)) + 
  geom_bar(stat = "identity") + 
  scale_fill_manual(values=c("#1b9e77", "#d95f02")) + 
  labs(y = 'Total Number of Workers', x = 'Length of Employment (in years)', 
       title = 'Length of Employment
       for Workers by Sex') + 
  scale_x_discrete(limits=c('<10','10-19','>=20'))

grid.arrange(plot1, plot2, ncol=2)
# of Male Workers
sum(bys$Byssinosis[bys$Sex == "M"]) + sum(bys$Non.Byssinosis[bys$Sex == "M"])

# of Female Workers
sum(bys$Byssinosis[bys$Sex == "F"]) + sum(bys$Non.Byssinosis[bys$Sex == "F"])

# Odds of Byssinossis for Smokers
smoke = sum(bys$Byssinosis[bys$Smoking == "Yes"]) / ( sum(bys$Byssinosis[bys$Smoking == "Yes"]) + sum(bys$Non.Byssinosis[bys$Smoking == "Yes"]) )

# Odds of Byssniossis for NonSmokers
noSmoke = sum(bys$Byssinosis[bys$Smoking == "No"]) / ( sum(bys$Byssinosis[bys$Smoking == "No"]) + sum(bys$Non.Byssinosis[bys$Smoking == "No"]) )

# Odds Ratio of Byssniosis for Smokers v NonSmokers
smoke/noSmoke

WhiteOdds = sum(bys$Byssinosis[bys$Race == "W"]) / ( sum(bys$Byssinosis[bys$Race == "W"]) + sum(bys$Non.Byssinosis[bys$Race == "W"]) )
OtherOdds = sum(bys$Byssinosis[bys$Race == "O"]) / ( sum(bys$Byssinosis[bys$Race == "O"]) + sum(bys$Non.Byssinosis[bys$Race == "O"]) )
RaceOR = WhiteOdds/OtherOdds

# Odds of Bysinosis by Workspace Type
sum(bys$Byssinosis[bys$Workspace == "Most Dusty"]) / (sum(bys$Byssinosis[bys$Workspace == "Most Dusty"]) + sum(bys$Non.Byssinosis[bys$Workspace == "Most Dusty"]))

sum(bys$Byssinosis[bys$Workspace == "Less Dusty"]) / (sum(bys$Byssinosis[bys$Workspace == "Less Dusty"]) + sum(bys$Non.Byssinosis[bys$Workspace == "Less Dusty"]))

sum(bys$Byssinosis[bys$Workspace == "Least Dusty"]) / (sum(bys$Byssinosis[bys$Workspace == "Least Dusty"]) + sum(bys$Non.Byssinosis[bys$Workspace == "Least Dusty"]))
# Model 1: Full Additive Model
Model1 = glm(cbind(Byssinosis,Non.Byssinosis) ~. - total_workers, data = bys,family = binomial())
# Model 2: Additive Model with Race and Sex Removed
Model2 = glm(cbind(Byssinosis,Non.Byssinosis) ~ .  -Race - Sex- total_workers, data = bys,family = binomial())
summary(Model2)
# Model 3: Null Model
Model3 = glm(cbind(Byssinosis,Non.Byssinosis) ~ 1, data = bys,family = binomial())
bestForwardAIC = step(Model3, scope = list(lower = Model3, upper = Model1),direction = "forward")

bestBackwardAIC=step(Model1, scope = list(lower = Model3, upper = Model1),direction = "backward")

bestBidirectionAIC=step(Model1, scope = list(lower = Model3, upper = Model1),direction = "both")