- **Run
cor.test() on the correlation between
“area” and “perm” in the rock data set and
interpret the results. Note that you will have to use the
“$” accessor to get at each of the two variables (like
this: rock$area). Make sure that you interpret both the
confidence interval and the *p-*value that is generated by
cor.test().**
# Load the 'datasets' package to access the rock dataset
library(datasets)
data(rock)
# Perform the correlation test between 'area' and 'perm'
cor_test_result <- cor.test(rock$area, rock$perm)
print(cor_test_result)
##
## Pearson's product-moment correlation
##
## data: rock$area and rock$perm
## t = -2.9305, df = 46, p-value = 0.005254
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6118206 -0.1267915
## sample estimates:
## cor
## -0.396637
# p-value: This is the probability of observing a correlation coefficient as extreme as or more extreme than -0.3966 (the sample estimate) if the true correlation in the population were zero. In this case, the p-value is less than the common significance level of 0.05 (p-value = 0.005254)), suggesting that the observed correlation is statistically significant. Given that, the correlation coefficient (r = -0.3966) suggests a moderate negative correlation. While this is not a negligible effect, there is a discernible trend for 'perm' to decrease as 'area' increases, this trend is unlikely to be due to random chance alone. While the p-value is well within the threshold of signifance ture this means that 'area' explains only a moderate proportion of the variability in 'perm'.
# Confidence Interval: The 95% confidence interval for the correlation coefficient ranges from -0.61 to -0.13 suggests a negative correlation between 'area' and 'perm' in the rock dataset. While the p-value in this correlation test suggests a statistically significant result (p-value = 0.005254), the evidence for this correlation is not strong enough to rule out the possibility of no correlation or a very weak correlation. The coefficient of determination (R-squared) derives the proportion of variance in one variable that can be explained by the other by squaring the correlation coefficient (R-squared = (-0.3966)^2 = 0.1573, or about 15.73%). This means that approximately 15.73% of the variability in 'perm' can be explained by changes in 'area'; while the remaining 84.27% of the variability is due to other factors not accounted for in this analysis.
- Create a copy of the
bfCorTest() custom
function presented in this chapter. Don’t forget to “source” it (meaning
that you have to run the code that defines the function one time to make
R aware of it). Conduct a Bayesian analysis of the correlation between
“area” and “perm” in the rock data
set.
# Source the `bfCorTest()` function to make it available in your environment
source("~/Documents/IST-686 Quantitative Reasoning For Data Science/HW7/bfCorTest.R")
# Conduct the Bayesian correlation analysis
bayes_cor_result <- bfCorTest(rock$area, rock$perm)
## Loading required package: coda
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.7. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## -0.342901 0.135693 0.001357 0.001544
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## -0.61696 -0.43302 -0.34012 -0.25066 -0.07837
print(bayes_cor_result)
## Bayes factor analysis
## --------------
## [1] rhoNot0 : 8.072781 ±0%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS
# Empirical Mean and SD: The mean of the posterior samples for rhoNot0 is -0.3449, and the standard deviation is 0.1369. This means that the most likely value for the correlation coefficient is -0.3449, and the values are spread out with a standard deviation of 0.1369.
# Quantiles: The 95% credible interval (the range within which we are 95% certain that the true correlation lies) is -0.6148 to -0.0727. This interval does not include zero, indicating that there is evidence for a negative correlation between the two variables.
# Bayes factor = 8.072781: The Bayes factor is 8.07, meaning the data are about 8 times more likely under the model with a correlation (i.e., your linear model x ~ rhoNot0) than under the null model (no correlation, rhoNot0 ~ 1). This is considered moderate evidence in favor of the correlation.
# rhoNot0: This is the parameter representing the standardized regression coefficient (and, in this case, the correlation coefficient).
# ±0%: This means the error in calculating the Bayes Factor is negligible.
# Evidence for Correlation: The Bayes factor of 8.07 provides moderate evidence for a correlation between the variables.
# Negative Correlation: The 95% credible interval (-0.6148 to -0.0727) and the mean of the posterior samples (-0.3449) both suggest that the correlation is negative. This means that as one variable increases, the other tends to decrease.
# Magnitude of Correlation: The correlation coefficient is estimated to be moderately strong, around -0.3449, but there is some uncertainty as reflected in the credible interval.
- Not unexpectedly, there is a data set in R that contains
these data. The data set is called
UCBAdmissions and you
can access the department mentioned above like this:
UCBAdmissions[ , ,1]. Make sure you put two commas before
the 1: this is a three dimensional contingency table that
we are subsetting down to two dimensions. Run chisq.test()
on this subset of the data set and make sense of the
results.
# Load the UCBAdmissions data
data(UCBAdmissions)
# Extract data for Department A (the first slice of the 3D table)
# The 1 means "select only the first slice" along the Dept dimension.
# In this dataset, the first slice corresponds to department A.
dept_A_data <- UCBAdmissions[,,1]
# Display the contingency table for Department A
print(dept_A_data)
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
# Perform the chi-square test of independence
chisq_result <- chisq.test(dept_A_data)
# Print the chi-square test results
print(chisq_result)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: dept_A_data
## X-squared = 16.372, df = 1, p-value = 5.205e-05
# Yates' continuity correction is applied to adjust the chi-square statistic when dealing with small sample sizes or low expected cell counts. It makes the test more conservative, reducing the chance of a Type I error (false positive result).
# X-squared = 16.372
# This is the chi-squared test statistic calculated with Yates' continuity correction, measuring the discrepancy between the observed frequencies in dept_A_data and the expected frequencies under the null hypothesis of no association between gender and admission status.
#df = 1
# This represents the degrees of freedom for the test. In a 2x2 contingency table like dept_A_data, there is 1 degree of freedom.
# p-value = 5.205e-05
# This is the probability of obtaining a chi-squared statistic as extreme as or more extreme than 16.372 if the null hypothesis (no association between gender and admission) were true. The p-value is very small (p-value = .00005205), which means it is highly unlikely to observe such a discrepancy between the observed and expected frequencies by chance alone.
# Interpretation:
# Given that the p-value is much smaller than the conventional significance level of 0.05, we reject the null hypothesis. This means there is strong evidence to suggest a statistically significant association between gender and admission status in Department A. The Yates' continuity correction was likely applied because some expected cell counts in the contingency table might be small.
# While this test indicates an association, it does not prove that gender causes differences in admission rates. Other factors could be influencing the outcome. The chi-squared test anly reveals whether there is a statistical association between two categorical variables, but it cannot isolate the effect of gender from these other potential factors alone. To draw conclusions about causation, further investigation and more rigorous study designs are necessary.
#To establish causation, you would need a different study design.
- Use
contingencyTableBF() to conduct a Bayes
factor analysis on the UCB admissions data. Report and interpret the
Bayes factor.
# Initialize an empty list to store Bayes factors for each department
bf_results <- list()
# Loop through each department (slice)
for (i in 1:6) {
# Extract data for the current department
dept_data <- UCBAdmissions[,,i]
# Perform Bayes Factor analysis with fixedMargin = "cols"
# fixedMargin = "cols" means that the column sums are fixed the total number of male and female applicants is fixed, as opposed to fixedMargin = "rows" which means the number of admitted applicants within each gender are predetermined.
# priorConcentration = 1 corresponds to a relatively uninformative or "weak" prior, which means no strong assumptions about the association between gender and admission are being made status beforehand to allow the data to speak more strongly in determining the Bayes factor.
bf_result <- contingencyTableBF(dept_data, sampleType = "indepMulti", fixedMargin = "cols", priorConcentration = 1)
# Store the result in the list
bf_results[[i]] <- bf_result
}
# Print Bayes factors for all departments
print(bf_results)
## [[1]]
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 1352.131 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
##
##
## [[2]]
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 0.2650097 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
##
##
## [[3]]
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 0.1203794 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
##
##
## [[4]]
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 0.0978091 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
##
##
## [[5]]
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 0.1590309 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
##
##
## [[6]]
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 0.0559148 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
# Department 1 (BF = 1352.131): There is very strong evidence for an association between gender and admission status in this department. The data are over 1000 times more likely under the hypothesis of an association than under the hypothesis of no association.
# Departments 2-5 (BFs ranging from 0.0978 to 0.2650): There is moderate to strong evidence against an association in these departments. The data are somewhat more likely under the null hypothesis of independence than under the alternative hypothesis of an association.
# Department 6 (BF = 0.0559): There is strong evidence against an association in this department. The data are about 18 times more likely under the null hypothesis than under the alternative hypothesis.
# The evidence suggests that there seems to be a strong gender bias in admissions for Department 1, while there might not be a substantial association between gender and admission status in the remaining departments (2-6).