http://www.sun.ac.za/english/data-science-and-computational-thinking
library(modelr)
library(comprehenr)
Odds and odds ratios express the influence of an exposure (versus not being exposed) on a binary outcome. As with risk, the notion of a positive outcome in not sentimental, but indicates the outcome under consideration.
In (1) below, we note the result of a study. Some participants were exposed and some were not. This exposure might be the administration of a new drug and the non exposure group might then have received a placebo. The participants are then observed for a binary outcome, such as improvement versus non-improvement.
\[\begin{array} \text{} & \text{not positive outcome} & \text{positive outcome} & \text{row totals} \\ \text{not exposed} & a & b & a+b \\ \text{exposed} & c & d & c+d \\ \text{column totals} & a+c & b+d & a+b+c+d \end{array} \tag{1}\]
The aim is then to express the odds ratio, comparing the two groups. The odds ratio is used in similar vain as the risk ratio. It is the only ratio than can be used in case-control series. As we will see from the equations below, the odds ratio is not reliant on knowledge of the population (which is approximated by the randomized participants in a trial). It is only in the case of very rare disease (low incidence) that the odds ratio approximates the risk ratio.
The odds is calculated as in (2).
\[\text{odds} = \frac{p}{1 - p} \tag{2}\]
Here, \(p\), is the probability of success, the latter being the outcome under investigation. If we consider the exposed participants from the table in (1), we can calculate \(p\) for the positive outcome as in (3).
\[p \left( \text{positive outcome | exposed} \right) = \frac{d}{c+d}\tag{3}\]
The probability for not positive outcome in the exposure group is given in (4).
\[p \left( \text{not positive outcome | exposed} \right) = \frac{c}{c+d}\tag{4}\]
Notice how this is the same as \(1 - p \left( \text{positive outcome | exposed} \right)\), shown in (5).
\[\begin{align}&1 - p \left( \text{positive outcome | exposed} \right) \\ = &1 - \frac{d}{c+d} \\ = &\frac{c+d}{c+d} - \frac{d}{c+d} \\ = &\frac{c+d-d}{c+d} \\ =&\frac{c}{c+d} \\ = &p \left( \text{not positive outcome | exposed} \right) \end{align} \tag{5}\]
This simplifies the equation for the odds, derived in (6).
\[\begin{align} \text{odds} = &\frac{\frac{d}{c+d}}{1 - \frac{d}{c+d}} \\ &=\frac{d}{c+d} \cdot \frac{c+d}{c+d-d} \\ &= \frac{d}{c} \end{align} \tag{6}\]
The odds is therefor a simple fraction. Using a sport’s metaphor, if your team wins eight games and loses four, the odds of winning are eight over four (analogous to \(d\) over \(c\)), which is simply two. The odds of winning against not winning is two.
In (1), there are two groups, the exposed and the not exposed. We can calculate odds for both groups (with respect to a positive vs. not a positive outcome). This allows us to express an odds ratio, given in (7).
\[\begin{align} \text{odds ratio} &= \frac{\frac{d}{c}}{\frac{b}{a}} \\ &=\frac{ad}{bc} \end{align} \tag{7}\] Given being in the exposed group can then increase or decrease the odds of the positive outcome as opposed to being in the non exposed group. A value for the OR of \(1.0\) neither increases nor decreases the odds. A value greater than \(1.0\) increases the odds and a value lower than \(1.0\) decreases the odds.
Consider a case-control series regarding the development of surgical site sepsis in lower gastrointestinal tract surgery classified as clean-contaminated surgery. The control group received a single standard, pre-operative dose of prophylactic antibiotics and the treatment group received double the standard dose.
The group of interest is the double-dose group and the positive outcome is the development of surgical site infection. The results are shown in (8).
\[\begin{array} \text{} & \text{infection} & \text{no infection} & \text{row totals} \\ \text{double dose} & 15 & 86 & 101 \\ \text{standard dose} & 19 & 71 & 90 \\ \text{column totals} & 34 & 157 & 191 \end{array} \tag{8}\]
The odds for infection in both groups are calculated below.
odds_double_infection <- 15 / 86
odds_standard_infection <- 19 / 71
The odds for infection in the treatment group is 0.17 and the odds for infection in the control group is 0.27.
The odds ratio for the treatment group with respect to the control group is calculated next.
odds_ratio_infection <- odds_double_infection / odds_standard_infection
odds_ratio_infection
## [1] 0.6517748
The odds ratio (OR) is 0.65. We can state that the odds for infection are lower in the treatment group, since the value is less than \(1.0\). If we subtract the OR from \(1.0\) and multiply the result by \(100\)%, we get a percentage reduction in odds, which is 34.8%.
This is the result of our single experiment. We need to express uncertainty in this finding, which we do by calculating confidence intervals. There is an equation to calculate the confidence intervals, shown in (10), but we can also use resampling to calculate the confidence intervals.
The equation for calculating the confidence intervals of the odds ratio requires calculating the standard error of the odds ratio. This is shown in (9).
\[\text{standard error of the odds ratio} = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}} \tag{9}\]
We calculate the standard error below.
standard_error_or <- sqrt(1/15 + 1/86 + 1/19 + 1/71)
The equation for the confidence interval bounds of the odds ratio is given in (10). The \(\pm\) separately calculates the lower bound, using subtraction, and the upper bound is calculated using addition. The \(\ln\) is the natural logarithm.
\[\text{CI}_{\text{OR}} = e^{\ln{\left(\text{OR}\right) \pm z\text{SE}}} \tag{10}\] For \(\alpha = 0.05\), the \(z\) value is approximately \(1.96\). We calculate the lower and upper bounds below, using (10).
lower_bound <- exp(log(odds_ratio_infection) - 1.96 * standard_error_or)
lower_bound
## [1] 0.3089952
upper_bound <- exp(log(odds_ratio_infection) + 1.96 * standard_error_or)
upper_bound
## [1] 1.374812
We can now state that the OR is 0.65 with a \(95\)% CI of 0.31 to 1.37. Since the intervals includes values below and above \(1.0\), the uncertainty in our OR means that its is possible that the use of a double dose can both decrease and increase the odds of infection in the population. If we were to calculate a p value for this OR, it would therefor not be significant.
Non-parametric resampling uses resampling from the original dataset. Since we replace the resampled data at every step, this is a bootstrap resampling technique.
Below, we import the data file that gave rise to our table of observed values, shown in (8).
df <- read.csv("data_OR.csv")
head(df)
## Group Infection
## 1 Standard No
## 2 Double No
## 3 Standard No
## 4 Double No
## 5 Double No
## 6 Standard Yes
The read_csv function creates a data.frame object. Using the str function to investigate the structure of the object, we not that there are two variables, each identified as being of character type.
str(df)
## 'data.frame': 191 obs. of 2 variables:
## $ Group : chr "Standard" "Double" "Standard" "Double" ...
## $ Infection: chr "No" "No" "No" "No" ...
We need to change this to factor type (nominal categorical variables). This is done using the factor function. The levels set the actual sample space elements in the order such that the last element is the element under investigation. In our example that would be the Double group and the Yes element for infection.
df$Group <- factor(df$Group, levels = c("Standard", "Double"))
df$Infection <- factor(df$Infection, levels = c("No", "Yes"))
We can use the xtabs function to recreate the table of observation in (8).
xtabs(~Infection + Group, df)
## Group
## Infection Standard Double
## No 71 86
## Yes 19 15
We can express the four values as a vector.
freq_vector <- as.vector(xtabs(~Infection + Group, df))
freq_vector
## [1] 71 19 86 15
For (7) this would mean \(a=71\), which is freq_vector[1], \(b=86\), which is freq_vector[3], \(c=19\), which is freq_vector[2], and \(d=15\), which is freq_vector[4]. The odds ratio is therefor as calculated below.
or <- (freq_vector[1] * freq_vector[4]) / (freq_vector[2] * freq_vector[3])
or
## [1] 0.6517748
It is \(0.65\) as we calculated above.
The resample_bootstrap function in the modelr library resamples from an original data.frame object. It returns the indices (row numbers) of the resamples for the new data.frame. The sample size is the same as the original.
resample_bootstrap(df)
## <resample [191 x 2]> 27, 25, 165, 161, 9, 4, 167, 34, 62, 1, ...
Now we create a function that will return the odds ratio of a model from a bootstrap resampled data.frame object.
or_simulation <- function(dataframe){
frq <- as.vector(xtabs(~Infection + Group, resample_bootstrap(dataframe)))
return((frq[1] * frq[4]) / (frq[2] * frq[3]))
}
We use the to_vec function from the comprehenr library to create a vector of \(5000\) bootstrapped odds ratio values.
ors <- comprehenr::to_vec(for (i in 1:5000) or_simulation(df))
A histogram shows the frequency of the odds ratio values.
hist(ors,
main = "Histogram of simulated OR values",
xlab = "OR",
ylab = "Frequency",
col = "orange",
las = 1)
We now use quantile values to calculate the values at the \(2.5\)% and \(97.5\)% percentiles as lower and upper bounds of the confidence intervals.
quantile(ors, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 0.287906 1.374798
These are close to the values found analytically.
The OR allows us to express and increase or decrease in risk based on being in one of two groups. It is the only ratio that we can use in case-control series. It is important to express uncertainty in sample statistics. This includes the OR. The confidence intervals bounds around the OR are easily calculated and interpreted.