In 1973, a large cotton textile company in North Carolina participated in a study. The purpose of the study was to investigate the prevalence of byssinosis, a form of pneumoconiosis. Data was collected on 5,419 workers. Variables recorded include:
Type of work place [1 (most dusty), 2 (less dusty), 3 (least dusty)]
Employment, years [< 10, 10-19, 20-]
Smoking [Smoker, or not in last 5 years]
Sex [Male, Female]
Race [White, Other]
Byssinosis [Yes, No]
This data set is Binomial data.
Our two graphs below compare elements of the dataset:
Out of the 5149 workers, 165 of them have Byssinosis (about 3%). This is a very low rate.
There are slighly more male workers (2916) than female workers (2503) at the company.
Most of the workers at the company are White (68%).
The first graph shows that more than half of the workers smoke. The majority of the smokers are male and the majority of the nonSmokers are female.
The second graph contrasts length of employment and sex. More workers have worked at the company for less than 10 years, compared to 10-19 or 20+ years. Interestingly, there are more workers in the <10 and 20+ range for employment than 10-19. This may suggest high turnover in the industry. If an employee worked a the company for over 10 years they are more likely to be a male.
The odds ratio of Male v. Female having Byssinosis is 2.969488. Males are 2.97x more likely to have Byssinosis than Females.
The odds ratio of Smokers v. NonSmokers having Byssinosis is 2.185246. Meaning that Smokers are 2.19x more likley to have Byssinosis.
The odds of having Byssinosis based on Workspace [Most Dusty, Less Dusty, Least Dusty] are: 0.1569507, 0.01384615, 0.01217391. Dustier workspaces lead to higher odds of Byssinosis, but there is little difference in these odds for less and least dusty workspaces.
Using the Wald Test and \(\alpha\) = 0.05, We drop Sex and Race from our model, as both of their p-values are below our chosen significance level of 0.05. This leaves us with the predictors Workspace, Smoking, and Employment.
##
## Call:
## glm(formula = cbind(Byssinosis, Non.Byssinosis) ~ . - Race -
## Sex - total_workers, family = binomial(), data = bys)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6239 -0.7904 -0.1946 0.2679 1.6605
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4546 0.2047 -11.992 < 2e-16 ***
## Employment>=20 0.6728 0.1813 3.710 0.000207 ***
## Employment10-19 0.5060 0.2490 2.032 0.042119 *
## SmokingYes 0.6210 0.1908 3.255 0.001133 **
## WorkspaceLess Dusty -2.5493 0.2614 -9.753 < 2e-16 ***
## WorkspaceLeast Dusty -2.7175 0.1898 -14.314 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 322.527 on 64 degrees of freedom
## Residual deviance: 43.882 on 59 degrees of freedom
## AIC: 162.56
##
## Number of Fisher Scoring iterations: 5
We see that for each year of employment, the probability of getting the disease increases. We also see that the probability of getting the disease decreases for less dusty workspaces, so higher dust increases chances of disease. Finally, we also see that smoking increases risk for the disease. The most impactful factor on risk for this disease is employment, as only a one year increase changes the odds by 0.5 or 0.6 each time, depending on which employment category the employee is in. This is different than smoking status, that only increases risk by 0.6 if someone smokes at all.
\(log(\frac{\pi}{1 -\pi})=b0+b1*X1+b2*X2\)
b0 = summary(Model2)$coefficients[1,1]
b1 = summary(Model2)$coefficients[2,1]
b2 = summary(Model2)$coefficients[3,1]
b3 = summary(Model2)$coefficients[4,1]
b4 = summary(Model2)$coefficients[5,1]
b5 = summary(Model2)$coefficients[6,1]
Running Best Forward AIC, Best Backward AIC, and Best BiDirectional AIC all gave the same variables for the model: Employment, Smoking and Workspace. Running best Forward BIC, Best Backward BIC, and Best BiDirectional BIC all gave the same variables as well: Employment and Smoking. Both measures AIC and BIC support dropping the variables Sex and Race, but differ on whether to drop Workspace.
AIC and BIC disagree on whether to remove the Workspace predictor from our model, but since our Wald test kept the predictor at significance level 0.05, we choose to keep Workspace. This leaves us with three predictors in our model: Smoking, Workspace, and Employment. Predictors Sex and Race are insignificant predictors for this regression. We conclude that in order for an employee to minimize risk of Byssinosis in this workspace, they should not smoke, work in the least dusty workspace, and quit as soon as possible. The company may be liable for damages caused by the disease as these factors for the disease are under their control. Further analysis should be conducted by the government to determine liability.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = '~/Downloads/' )
library('ggplot2')
library('bestglm')
library('plyr')
library('gridExtra')
# load Byssinosis.csv into a table
bys = read.table('/Users/jon/Downloads/Byssinosis.csv', header = TRUE, sep = ",")
# Change workspace to be a factor not numeric
bys$Workspace = as.factor(bys$Workspace)
bys$Workspace = revalue(bys$Workspace, c("1" = "Most Dusty", "2" = "Less Dusty", "3" = "Least Dusty"))
# adds a new columns for total_workers
bys$total_workers = ""
# changes column type to numeric
bys$total_workers = as.integer(bys$total_workers)
# fill in total_workers column
i = 1
for (i in 1:nrow(bys))
bys$total_workers[i] = sum(bys$Byssinosis[i]) + sum(bys$Non.Byssinosis[i])
i = i +1
# workers total
sum(bys$Byssinosis) + sum(bys$Non.Byssinosis)
# workers with Byssinosis
sum(bys$Byssinosis)
par(mfrow=c(1,2))
# Stacked barplot with multiple groups
plot1 = ggplot(data = bys, aes(x = Smoking, y = total_workers, fill = Sex)) +
geom_bar(stat="identity") +
ylab('Total Number of Workers') +
labs(title = 'Number of Workers by
Sex and Smoking Status', x = 'Smoking Status') +
# fill colors are colorblind safe and print friendly
# http://colorbrewer2.org/#type=qualitative&scheme=Dark2&n=3
scale_fill_manual(values=c("#1b9e77", "#d95f02"))
plot2 = ggplot(data = bys, aes(x = Employment, y = total_workers, fill = Sex)) +
geom_bar(stat = "identity") +
scale_fill_manual(values=c("#1b9e77", "#d95f02")) +
labs(y = 'Total Number of Workers', x = 'Length of Employment (in years)',
title = 'Length of Employment
for Workers by Sex') +
scale_x_discrete(limits=c('<10','10-19','>=20'))
grid.arrange(plot1, plot2, ncol=2)
# of Male Workers
sum(bys$Byssinosis[bys$Sex == "M"]) + sum(bys$Non.Byssinosis[bys$Sex == "M"])
# of Female Workers
sum(bys$Byssinosis[bys$Sex == "F"]) + sum(bys$Non.Byssinosis[bys$Sex == "F"])
# Odds of Byssinossis for Smokers
smoke = sum(bys$Byssinosis[bys$Smoking == "Yes"]) / ( sum(bys$Byssinosis[bys$Smoking == "Yes"]) + sum(bys$Non.Byssinosis[bys$Smoking == "Yes"]) )
# Odds of Byssniossis for NonSmokers
noSmoke = sum(bys$Byssinosis[bys$Smoking == "No"]) / ( sum(bys$Byssinosis[bys$Smoking == "No"]) + sum(bys$Non.Byssinosis[bys$Smoking == "No"]) )
# Odds Ratio of Byssniosis for Smokers v NonSmokers
smoke/noSmoke
WhiteOdds = sum(bys$Byssinosis[bys$Race == "W"]) / ( sum(bys$Byssinosis[bys$Race == "W"]) + sum(bys$Non.Byssinosis[bys$Race == "W"]) )
OtherOdds = sum(bys$Byssinosis[bys$Race == "O"]) / ( sum(bys$Byssinosis[bys$Race == "O"]) + sum(bys$Non.Byssinosis[bys$Race == "O"]) )
RaceOR = WhiteOdds/OtherOdds
# Odds of Bysinosis by Workspace Type
sum(bys$Byssinosis[bys$Workspace == "Most Dusty"]) / (sum(bys$Byssinosis[bys$Workspace == "Most Dusty"]) + sum(bys$Non.Byssinosis[bys$Workspace == "Most Dusty"]))
sum(bys$Byssinosis[bys$Workspace == "Less Dusty"]) / (sum(bys$Byssinosis[bys$Workspace == "Less Dusty"]) + sum(bys$Non.Byssinosis[bys$Workspace == "Less Dusty"]))
sum(bys$Byssinosis[bys$Workspace == "Least Dusty"]) / (sum(bys$Byssinosis[bys$Workspace == "Least Dusty"]) + sum(bys$Non.Byssinosis[bys$Workspace == "Least Dusty"]))
# Model 1: Full Additive Model
Model1 = glm(cbind(Byssinosis,Non.Byssinosis) ~. - total_workers, data = bys,family = binomial())
# Model 2: Additive Model with Race and Sex Removed
Model2 = glm(cbind(Byssinosis,Non.Byssinosis) ~ . -Race - Sex- total_workers, data = bys,family = binomial())
summary(Model2)
# Model 3: Null Model
Model3 = glm(cbind(Byssinosis,Non.Byssinosis) ~ 1, data = bys,family = binomial())
bestForwardAIC = step(Model3, scope = list(lower = Model3, upper = Model1),direction = "forward")
bestBackwardAIC=step(Model1, scope = list(lower = Model3, upper = Model1),direction = "backward")
bestBidirectionAIC=step(Model1, scope = list(lower = Model3, upper = Model1),direction = "both")