Analysis in the Health Sciences

By: Shreya Shakya
Book: Biostatistics: A Foundation for Analysis in the Health Sciences

Course: STAT 610- Applied Statistics for Health Professionals

  1. In a study by Cross et al. (A-2), patients who were involved in problem gambling treatment were asked about co-occurring drug and alcohol addictions.

Let the discrete random variable X represent the number of co-occurring addictive substances used by the subjects. Table 4.2.4 summarizes the frequency distribution for this random variable.

data_q1 <- read.csv('data_q1.csv')
#View(data_q1)
frequen <- data.frame(data_q1)
relative_freq <- transform(frequen, Relative_freq= prop.table(Frequency), Cum_freq_relative =cumsum(prop.table(Frequency)))
relative_freq
##    Substance_used Frequency Relative_freq Cum_freq_relative
## 1               0       144   0.185328185         0.1853282
## 2               1       342   0.440154440         0.6254826
## 3               2       142   0.182754183         0.8082368
## 4               3        72   0.092664093         0.9009009
## 5               4        39   0.050193050         0.9510940
## 6               5        20   0.025740026         0.9768340
## 7               6         6   0.007722008         0.9845560
## 8               7         9   0.011583012         0.9961390
## 9               8         2   0.002574003         0.9987130
## 10              9         1   0.001287001         1.0000000
  1. What is the probability that an individual selected at random used fewer than three addictive substances? Ans: Looking at the above table, probability that individual selected at random used fewer than three addictive substances is P(X <3) = 0.80824.



  1. The study cited in Exercise 5.3.1 reported an estimated mean serum cholesterol level of 183 for women aged 20–29 years. The estimated standard deviation was approximately 37. Use these estimates as the mean μ and standard deviation σ for the U.S. population. If a simple random sample of size 60 is drawn from this population, find the probability that the sample mean serum cholesterol level will be:
  1. Between 170 and 195
pnorm(195,mean=183,sd=37/sqrt(60))-pnorm(170,mean=183,sd=37/sqrt(60))
## [1] 0.9907523



  1. A study by Thienprasiddhi et al. (A-4) examined a sample of 16 subjects with open-angle glaucoma and unilateral hemifield defects. The ages (years) of the subjects were as follows.Can we conclude that the mean age of the population from which the sample may be presumed to have been drawn is less than 60 years? Let α = .05. 62 62 68 48 51 60 51 57 57 41 62 50 53 34 62 61

Ans: The data consist of ages for 16 subjects which constitute a simple random sample from a population of similar subjects. Looking at the qqplot, We assume that the ages in this population is approximately normally distributed.

H0: mu = 60 Hr: mu < 60

data <- c(62, 62,   68, 48, 51, 60, 51, 57, 57, 41, 62, 50, 53, 34, 62, 61)
qqnorm(data)
qqline(data,col=2)

(n <- length(data))
## [1] 16
(xbar <- mean(data))
## [1] 54.9375
(s <- sd(data))
## [1] 8.872946
(ts <- qt((1-.95),df=n-1,lower.tail=TRUE))
## [1] -1.75305
t <- (xbar-60)/(s/sqrt(n))
t
## [1] -2.282218
t.test(data, mu = 60, alternative = "less",var.equal=F,conf.level=0.95)
## 
##  One Sample t-test
## 
## data:  data
## t = -2.2822, df = 15, p-value = 0.01874
## alternative hypothesis: true mean is less than 60
## 95 percent confidence interval:
##      -Inf 58.82618
## sample estimates:
## mean of x 
##   54.9375

Since the population variance is unknown, our test statistic is given by t. From our sample data, we compute a sample mean of 54.94 and a sample standard deviation of 8.87. Substituting these statistics into test-statistic gives us -2.2822.

Reject H0, since −2.2822 falls in the rejection region. Our conclusion, based on these data, is that the mean of the population from which the sample came is less than 60.



  1. Patients suffering from rheumatic diseases or osteoporosis often suffer critical losses in bone mineral density (BMD). Alendronate is one medication prescribed to build or prevent further loss of BMD. Holcomb and Rothenberg (A-3) looked at 96 women taking alendronate to determine if a difference existed in the mean percent change in BMD among five different primary diagnosis classifications. Group 1 patients were diagnosed with rheumatoid arthritis (RA). Group 2 patients were a mixed collection of patients with diseases including lupus, Wegener’s granulomatosis and polyarteritis, and other vasculitic diseases (LUPUS). Group 3 patients had polymyalgia rheumatica or temporal arthritis (PMRTA). Group 4 patients had osteoarthritis (OA), and group 5 patients had osteoporosis (O) with no other rheumatic diseases identified in the medical record. Changes in BMD are shown in the following table.

Step 1: H0: Mu1 = Mu2 = Mu3 = Mu4 = Mu5
HA: Not all means are equal.

data_1 <- read.csv('EXR_C08_S02_02.csv')
data_1$GROUP = as.factor(data_1$GROUP)

tally(~GROUP,data=data_1)
## GROUP
##  1  2  3  4  5 
## 37  9 16 24 10
stripchart(data_1$BMD~data_1$GROUP,method = "stack",pch=19,xlab='BMD',ylab='GROUP')

boxplot(data_1$BMD~data_1$GROUP)

gf_qq(~BMD|GROUP, data=data_1) %>% gf_qqline()

Looking at the graphs, the data sample can be considered normal. Now testing for equality of variances, first we check the SDs of all the groups.

favstats(BMD~GROUP, data=data_1)
##   GROUP    min       Q1 median       Q3    max     mean       sd  n missing
## 1     1 -9.646 -0.37200 4.6590  7.52100 24.414 4.469892 7.499160 37       0
## 2     2 -1.369  2.83200 3.9970  7.26000 11.288 4.578000 3.991317  9       0
## 3     3 -7.816  0.34950 1.3505  4.74700 10.734 2.180813 4.542536 16       0
## 4     4 -2.817  1.21550 4.7020  7.33675 15.853 5.204542 4.621640 24       0
## 5     5  1.719  3.80475 6.2825 14.22400 25.655 9.671800 8.153822 10       0

Here, minimum SD is around half of maximum SD, hence we can consider equal variance. We use Levene’s test for Homogeneity of Variance.

with(data_1, leveneTest(BMD, GROUP, center=mean))
## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value  Pr(>F)  
## group  4  2.2802 0.06667 .
##       91                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This test has equal variances, so we would prefer not to reject the null hypothesis. The p-value of 0.0667 > 0.5, so we fail to reject H0, so we can go ahead and run the ANOVA to calculate the F-value and P-value.

AnovaModel.1 <- aov(BMD ~ GROUP, data=data_1)
summary(AnovaModel.1)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## GROUP        4    355   88.86   2.277  0.067 .
## Residuals   91   3551   39.02                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since (p-value) 0.067 > 0.05, we failed to reject null hypothesis.We can conclude that the Null hypothesis is true. That is no significance difference existed in the mean percent change in BMD among five different primary diagnosis classifications.No post-hoc analysis (Tukey’s or Bonferroni’s) is required since we failed to reject H0.



  1. Family caregiving of older adults is more common in Korea than in the United States. Son et al. (A-3) studied 100 caregivers of older adults with dementia in Seoul, South Korea. The dependent variable was caregiver burden as measured by the Korean Burden Inventory (KBI). Scores ranged from 28 to 140, with higher scores indicating higher burden. Explanatory variables were indexes that measured the following:

ADL: total activities of daily living (low scores indicate that the elderly perform activities independently). MEM: memory and behavioral problems (higher scores indicate more problems). COG: cognitive impairment (lower scores indicate a greater degree of cognitive impairment).

data_mr <- read.csv('EXR_C10_S03_02.csv')
scatterplotMatrix(~KBI+ADL+MEM+COG, smooth=FALSE,  diagonal = 'density', data=data_mr)

Checking the assumptions for Multiple Linear regression

# Let's make a residual.analysis function to save typing.
residual.analysis <- function(model=NULL) {
  require(gridExtra)
  df <- data.frame(p=predict(model),
                   r=rstandard(model))
  p1 <- gf_point(r~p,data=df) %>% gf_hline(yintercept = 0)
  p2 <- gf_qq(~r,data=df) %>% gf_qqline()
  grid.arrange(p1, p2, ncol=2)
}
RegModel.mr <- lm(KBI~ADL+MEM+COG, data=data_mr)
residual.analysis(RegModel.mr)

Here, both the assumption of constant variance and linearity holds good.

H0: B1 = B2 = B3 =0

HA: Not all B = 0

summary(RegModel.mr)
## 
## Call:
## lm(formula = KBI ~ ADL + MEM + COG, data = data_mr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.037 -10.535  -1.503   9.213  43.151 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40.4908    10.1030   4.008 0.000121 ***
## ADL           0.2162     0.1168   1.851 0.067273 .  
## MEM           0.5547     0.1300   4.267 4.65e-05 ***
## COG           0.1210     0.3003   0.403 0.687978    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.26 on 96 degrees of freedom
## Multiple R-squared:  0.282,  Adjusted R-squared:  0.2596 
## F-statistic: 12.57 on 3 and 96 DF,  p-value: 5.315e-07

The regression line is: KBI = 40.491 + 0.216 * ADL + 0.555 * MEM + 0.121 * COG

From the above table, coeff of determination is 0.282. We can say that about 28 percent of the total variation in the Y values is explained by the fitted regression plane, that is, by the linear relationship with ADL, MEM and COG level.

For F test, F is 12.57 and p-value is 5.315e-07. Since p-value < 0.05, we reject H0 and can conlcude that not all slopes are equal to zero.

For individual t-tests, since p-value for MEM = 4.65e-05. Since pvalue < 0.05, we reject H0 for MEM and we can conclude that slope of MEM in regression line is not equal to zero. Therefore, linear model provides a good fit to the data. However, p-values for ADL and COG > 0.05, we failed to reject H0 and we can conclude that slope of ADL and COG in regression line is equal to zero.

95% CI for only the MEM factor (which is only significant):

confint(RegModel.mr,level=0.95)
##                   2.5 %     97.5 %
## (Intercept) 20.43647272 60.5451146
## ADL         -0.01567293  0.4480273
## MEM          0.29662925  0.8126774
## COG         -0.47511019  0.7170349

The 95% CI for MEM is (0.29663,0.81268)

We are 95 percent confident that the interval constructed for MEM (0.29662925,0.8126774).