To investigate the association between place of habitation and health, researchers have recorded BMI from a number subjects residing in two city blocks, marked as Blocks A and B. The data are in the first tab of the data file. Perform an appropriate procedure to test whether place of habitation is associated with BMI.
To start, we should read the data into R Studio. By looking at the first few lines we are able to get a sense of how the dataset is structured. Since we are interested in differences in BMI between habitation groups it’s a good idea to subset our dataset into two different variables, block A and block B.
Q1 <- as.data.frame(read_excel("HW3.xlsx", sheet = 1))
Q1.A = subset(Q1, Block == "A", select = BMI)
Q1.B = subset(Q1, Block == "B", select = BMI)
head(Q1)
## ID BMI Block
## 1 1 26.18310 A
## 2 2 25.43993 A
## 3 3 25.91070 A
## 4 4 27.17430 A
## 5 5 26.23963 A
## 6 6 24.92497 A
It is recommended to visualize the data before doing any analysis. The boxplot indicates a huge difference in the mean of BMIs between the two blocks. The density plot also shows the distribution of the two data that seems to be a normal distribution.
boxplot(BMI ~ Block, data = Q1, main = "The differences in BMI between block A and block B", ylab = "BMI", xlab = "Block")
densityplot(~ BMI | Block, data = Q1, layout = c(1,2), col = c("red"))
Although it can be estimated that the distribution of the data is normal from the density plot, it is recommended to confirm this fact by plotting a Q-Q plot to visually inspect the normality of the data. Shapiro-Wilk Normality test was also performed to further confirm the normality.
ggqqplot(Q1.A$BMI, main = "Q-Q Plot for Block A Data",ggtheme = theme_bw())
shapiro.test(Q1.A$BMI)
##
## Shapiro-Wilk normality test
##
## data: Q1.A$BMI
## W = 0.98424, p-value = 0.9235
ggqqplot(Q1.B$BMI, main = "Q-Q Plot for Block B Data", ggtheme = theme_bw())
shapiro.test(Q1.B$BMI)
##
## Shapiro-Wilk normality test
##
## data: Q1.B$BMI
## W = 0.98688, p-value = 0.9648
The Q-Q plots demonstrate most of the data is plotted close to the Q-Q line indicating the normality of the data. The high p-value of Shapiro-Wilk test for both data suggest that we fail to reject the null hypothesis of data being normal. These plots and tests are strong evidences of normality of our data.
One last statistic to be computed is the variances of the data. Although it can be inferred from the boxplots that the variances are almost similar, further tests must be performed to confirm the significance of this inference. # Variances of both data were computed which confirmed the similarity of the variances. High p-value of F-Test indicated that we fail to reject the null hypothesis that the data have similar variances, therefore we can confidently state the variances of data are similar and the evidence is strong.
var(Q1.A$BMI)
## [1] 0.6956538
var(Q1.B$BMI)
## [1] 0.6498151
(Q1.v <- var.test(BMI ~ Block, data = Q1, mu = 0, alternative = "two.sided"))
##
## F test to compare two variances
##
## data: BMI by Block
## F = 1.0705, num df = 29, denom df = 29, p-value = 0.8556
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5095399 2.2492027
## sample estimates:
## ratio of variances
## 1.070541
Confirming the normality of the data will help us to choose a parametric test and since the data are not paired, unpaired t-test is the best option. Setting the var.equal as TRUE we indicate that the test considers the equal variances of the data.
(Q1.T <- t.test(BMI ~ Block, data = Q1, mu = 0, alternative = "two.sided", var.equal = T, conf.level = 0.95, paired = F))
##
## Two Sample t-test
##
## data: BMI by Block
## t = -13.921, df = 58, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.371987 -2.524156
## sample estimates:
## mean in group A mean in group B
## 25.59799 28.54606
By looking at the mean in each group (block A vs. block B) we can see a huge difference. The extremely low p-value which is less than \(\alpha = 0.05\) shows the significance of the test. With that being said, we can safely reject the null hypothesis that there is no difference between the mean of each group. In other words, we can conclude that people who live in the block B location have a significantly higher BMI. The difference between the two locations’ BMIs is almost 3.
The power of the two sample test was computed to ensure the result of the t-test done before. The power of 0.99 indicates the high power of the test which means the probability of the type 2 errors were extremely low.
n_A <- length(Q1.A$BMI)
n_B <- length(Q1.B$BMI)
n1 <- min(n_A, n_B)
d <- abs((mean(Q1.A$BMI) - mean(Q1.B$BMI))/sd(c(Q1.A$BMI, Q1.B$BMI))) %>% round(., 2)
(Q1.p.t <- power.t.test(delta = d, n = n1, power = NULL, sig.level = 0.05, type = "two.sample", alternative = "two.sided"))
##
## Two-sample t test power calculation
##
## n = 30
## delta = 1.74
## sd = 1
## sig.level = 0.05
## power = 0.9999985
## alternative = two.sided
##
## NOTE: n is number in *each* group
plot(Q1.p.t)
A number of panic disorder patients went through an experimental treatment. First, their average times (in weeks) between two successive panic attacks were recorded. Then, they were asked to start a treatment regime (named DrugX). Once they are on the treatment, their average times (in weeks) between two successive panic attacks were recorded again. The data are in the second tab of the data file. Perform an appropriate procedure to test whether the treatment made a difference or not. # Just like the previous question, the data was imported into R Studio and was subsetted into two different variables. One variable containing the data for before the treatment and the other one for after the treatment.
Q2 <- as.data.frame(read_excel("HW3.xlsx", sheet = 2))
Q2.pre <- Q2$`No Treatment`
Q2.post <- Q2$DrugX
Before running any test, the data was visualized to have a better sense of our dataset. By just looking at the data, we suspect that there was no change in the panic attack episode before and after drug. However, this should be tested with appropriate analysis.
boxplot(Q2.pre, Q2.post, main = "Average time of two successive panic attacks per week", names = c("Pre-treatment", "Post-treatment"))
It’s necessary to check the normality of the data to decide whether we need a parametric test or non-parametric test. The Q-Q plots indicate that the data is not normal, however, Shapiro-Wilk test should be performed to confirm this. Shapiro-Wilk test for both of the data indicates that they both are not normal. \(p.value\) of both data are lower than 0.05 which indicates that the null hypothesis of data being normal can be rejected.
ggqqplot(Q2.pre, main = "Q-Q Plot for Pre-treatment Data", ggtheme = theme_bw())
shapiro.test(Q2.post)
##
## Shapiro-Wilk normality test
##
## data: Q2.post
## W = 0.84155, p-value = 0.01321
ggqqplot(Q2.post, main = "Q-Q Plot for Post-treatment Data", ggtheme = theme_bw())
shapiro.test(Q2.pre)
##
## Shapiro-Wilk normality test
##
## data: Q2.pre
## W = 0.83168, p-value = 0.009658
Also both of the variances for the data were computed and tested for significance using F-test. Variances were similar and high \(p-value\) of 0.94 indicates that the null hypothesis that variances are similar cannot be rejected.
var(Q2.pre)
## [1] 11.51775
var(Q2.post)
## [1] 11.9857
(Q2.v <- var.test(Q2.pre, Q2.post, mu = 0, alternative = "two.sided"))
##
## F test to compare two variances
##
## data: Q2.pre and Q2.post
## F = 0.96096, num df = 14, denom df = 14, p-value = 0.9417
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3226221 2.8622984
## sample estimates:
## ratio of variances
## 0.9609583
Based on how the question was worded, it seems like the drugX was a new drug that was being tested on patients. As in many clinical trials, we never know what will be the results of testing a new drug. Although in this case we are hoping to reduce the frequency of the panic attacks and one might suggests a one sided alternative for the test. Despite that, I still chose a two sided test which would show any changes, whether the drug increased or reduced the panic attacks. # paired = T was chosen because the two groups of data were from the same patients and can be paired. exact = F was chosen due to small sample size and possibility of ties within the sample data.
(Q2.v <- wilcox.test(Q2$`No Treatment`, Q2$DrugX, data = Q2, mu = 0, alternative = "two.sided", paired = T, conf.int = T, exact = F))
##
## Wilcoxon signed rank test with continuity correction
##
## data: Q2$`No Treatment` and Q2$DrugX
## V = 58, p-value = 0.9321
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -0.1453352 0.1120264
## sample estimates:
## (pseudo)median
## -0.0143766
The \(p-value\) of 0.9321 indicates that the test does not show significant evidence to reject the null hypothesis of the medians being similar. Therefore we can conclude that the drugX did not affect the patients in any way.
To understand how smoking behavior is associate with education attainment, questionnaires were sent out to a number of people and their responses were collected. The data are in the third tab of the data file. Perform an appropriate procedure. # The data was read into R Studio, however, the data was saved into a new matrix to be easily read by the specific functions in R.
Q3 <- as.data.frame(read_excel("HW3.xlsx", sheet = 3))
Q3 <- cbind(Q3$`Ever Smoker`, Q3$`Never Smoker`)
colnames(Q3) <- c("Ever Smoker","Never Smoker")
rownames(Q3) <- c("High School or Less", "Some College", "Bachelor's Degree", "Graduate or Professional Degree")
Just like other questions so far, the data was visualized before performing any tests. However, it’s hard to draw any speculations from the visualization of the data alone. There is an increasing trend among “Never Smoker” sample group based on their educational attainment, however, this trend was not similar in “Ever Smoker” group.
barplot(Q3, beside = TRUE, legend = rownames(Q3), args.legend = list(x = "topleft"))
There are two assumptions when performing Chi-Square test: 1. Less than 20% of the contingency cells have an expected value of less than 5. In other words your sample should be large enough. 2. Dataset should be independent.
The dataset (below) violates the first assumption of dataset being large. Therefore, Chi-square is not appropriate for our data and Fisher-exact test was performed.
| Ever Smoker | Never Smoker | |
|---|---|---|
| High School or Less | 3 | 3 |
| Some College | 9 | 10 |
| Bachelor’s Degree | 5 | 13 |
| Graduate or Professional Degree | 8 | 24 |
fisher.test(Q3, alternative = "two.sided", conf.level = 0.95)
##
## Fisher's Exact Test for Count Data
##
## data: Q3
## p-value = 0.3094
## alternative hypothesis: two.sided
The fisher exact test computed a \(p-value\) of 0.3094 for the two sided alternative hypothesis. The \(p-value\) is too high (above 0.05) to reject the null hypothesis of assuming independence of smoking behavior and education attainment. Therefore, we conclude with the given dataset we can say we don’t have enough evidence of relationship between smoking behavior and education attainment.