Course: STAT 610- Applied Statistics for Health Professionals
Let the discrete random variable X represent the number of co-occurring addictive substances used by the subjects. Table 4.2.4 summarizes the frequency distribution for this random variable.
data_q1 <- read.csv('data_q1.csv')
#View(data_q1)
frequen <- data.frame(data_q1)
relative_freq <- transform(frequen, Relative_freq= prop.table(Frequency), Cum_freq_relative =cumsum(prop.table(Frequency)))
relative_freq
## Substance_used Frequency Relative_freq Cum_freq_relative
## 1 0 144 0.185328185 0.1853282
## 2 1 342 0.440154440 0.6254826
## 3 2 142 0.182754183 0.8082368
## 4 3 72 0.092664093 0.9009009
## 5 4 39 0.050193050 0.9510940
## 6 5 20 0.025740026 0.9768340
## 7 6 6 0.007722008 0.9845560
## 8 7 9 0.011583012 0.9961390
## 9 8 2 0.002574003 0.9987130
## 10 9 1 0.001287001 1.0000000
pnorm(195,mean=183,sd=37/sqrt(60))-pnorm(170,mean=183,sd=37/sqrt(60))
## [1] 0.9907523
Ans: The data consist of ages for 16 subjects which constitute a simple random sample from a population of similar subjects. Looking at the qqplot, We assume that the ages in this population is approximately normally distributed.
H0: mu = 60 Hr: mu < 60
data <- c(62, 62, 68, 48, 51, 60, 51, 57, 57, 41, 62, 50, 53, 34, 62, 61)
qqnorm(data)
qqline(data,col=2)
(n <- length(data))
## [1] 16
(xbar <- mean(data))
## [1] 54.9375
(s <- sd(data))
## [1] 8.872946
(ts <- qt((1-.95),df=n-1,lower.tail=TRUE))
## [1] -1.75305
t <- (xbar-60)/(s/sqrt(n))
t
## [1] -2.282218
t.test(data, mu = 60, alternative = "less",var.equal=F,conf.level=0.95)
##
## One Sample t-test
##
## data: data
## t = -2.2822, df = 15, p-value = 0.01874
## alternative hypothesis: true mean is less than 60
## 95 percent confidence interval:
## -Inf 58.82618
## sample estimates:
## mean of x
## 54.9375
Since the population variance is unknown, our test statistic is given by t. From our sample data, we compute a sample mean of 54.94 and a sample standard deviation of 8.87. Substituting these statistics into test-statistic gives us -2.2822.
Reject H0, since −2.2822 falls in the rejection region. Our conclusion, based on these data, is that the mean of the population from which the sample came is less than 60.
Step 1: H0: Mu1 = Mu2 = Mu3 = Mu4 = Mu5
HA: Not all means are equal.
data_1 <- read.csv('EXR_C08_S02_02.csv')
data_1$GROUP = as.factor(data_1$GROUP)
tally(~GROUP,data=data_1)
## GROUP
## 1 2 3 4 5
## 37 9 16 24 10
stripchart(data_1$BMD~data_1$GROUP,method = "stack",pch=19,xlab='BMD',ylab='GROUP')
boxplot(data_1$BMD~data_1$GROUP)
gf_qq(~BMD|GROUP, data=data_1) %>% gf_qqline()
Looking at the graphs, the data sample can be considered normal. Now testing for equality of variances, first we check the SDs of all the groups.
favstats(BMD~GROUP, data=data_1)
## GROUP min Q1 median Q3 max mean sd n missing
## 1 1 -9.646 -0.37200 4.6590 7.52100 24.414 4.469892 7.499160 37 0
## 2 2 -1.369 2.83200 3.9970 7.26000 11.288 4.578000 3.991317 9 0
## 3 3 -7.816 0.34950 1.3505 4.74700 10.734 2.180813 4.542536 16 0
## 4 4 -2.817 1.21550 4.7020 7.33675 15.853 5.204542 4.621640 24 0
## 5 5 1.719 3.80475 6.2825 14.22400 25.655 9.671800 8.153822 10 0
Here, minimum SD is around half of maximum SD, hence we can consider equal variance. We use Levene’s test for Homogeneity of Variance.
with(data_1, leveneTest(BMD, GROUP, center=mean))
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 4 2.2802 0.06667 .
## 91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This test has equal variances, so we would prefer not to reject the null hypothesis. The p-value of 0.0667 > 0.5, so we fail to reject H0, so we can go ahead and run the ANOVA to calculate the F-value and P-value.
AnovaModel.1 <- aov(BMD ~ GROUP, data=data_1)
summary(AnovaModel.1)
## Df Sum Sq Mean Sq F value Pr(>F)
## GROUP 4 355 88.86 2.277 0.067 .
## Residuals 91 3551 39.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since (p-value) 0.067 > 0.05, we failed to reject null hypothesis.We can conclude that the Null hypothesis is true. That is no significance difference existed in the mean percent change in BMD among five different primary diagnosis classifications.No post-hoc analysis (Tukey’s or Bonferroni’s) is required since we failed to reject H0.
ADL: total activities of daily living (low scores indicate that the elderly perform activities independently). MEM: memory and behavioral problems (higher scores indicate more problems). COG: cognitive impairment (lower scores indicate a greater degree of cognitive impairment).
data_mr <- read.csv('EXR_C10_S03_02.csv')
scatterplotMatrix(~KBI+ADL+MEM+COG, smooth=FALSE, diagonal = 'density', data=data_mr)
Checking the assumptions for Multiple Linear regression
# Let's make a residual.analysis function to save typing.
residual.analysis <- function(model=NULL) {
require(gridExtra)
df <- data.frame(p=predict(model),
r=rstandard(model))
p1 <- gf_point(r~p,data=df) %>% gf_hline(yintercept = 0)
p2 <- gf_qq(~r,data=df) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)
}
RegModel.mr <- lm(KBI~ADL+MEM+COG, data=data_mr)
residual.analysis(RegModel.mr)
Here, both the assumption of constant variance and linearity holds good.
H0: B1 = B2 = B3 =0
HA: Not all B = 0
summary(RegModel.mr)
##
## Call:
## lm(formula = KBI ~ ADL + MEM + COG, data = data_mr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.037 -10.535 -1.503 9.213 43.151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.4908 10.1030 4.008 0.000121 ***
## ADL 0.2162 0.1168 1.851 0.067273 .
## MEM 0.5547 0.1300 4.267 4.65e-05 ***
## COG 0.1210 0.3003 0.403 0.687978
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.26 on 96 degrees of freedom
## Multiple R-squared: 0.282, Adjusted R-squared: 0.2596
## F-statistic: 12.57 on 3 and 96 DF, p-value: 5.315e-07
The regression line is: KBI = 40.491 + 0.216 * ADL + 0.555 * MEM + 0.121 * COG
From the above table, coeff of determination is 0.282. We can say that about 28 percent of the total variation in the Y values is explained by the fitted regression plane, that is, by the linear relationship with ADL, MEM and COG level.
For F test, F is 12.57 and p-value is 5.315e-07. Since p-value < 0.05, we reject H0 and can conlcude that not all slopes are equal to zero.
For individual t-tests, since p-value for MEM = 4.65e-05. Since pvalue < 0.05, we reject H0 for MEM and we can conclude that slope of MEM in regression line is not equal to zero. Therefore, linear model provides a good fit to the data. However, p-values for ADL and COG > 0.05, we failed to reject H0 and we can conclude that slope of ADL and COG in regression line is equal to zero.
95% CI for only the MEM factor (which is only significant):
confint(RegModel.mr,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 20.43647272 60.5451146
## ADL -0.01567293 0.4480273
## MEM 0.29662925 0.8126774
## COG -0.47511019 0.7170349
The 95% CI for MEM is (0.29663,0.81268)
We are 95 percent confident that the interval constructed for MEM (0.29662925,0.8126774).