install.packages(“pagedown”) output: pdf_document: pandoc_args: –pdf-engine=pagedown::chrome_print
\({cov(\widehat{y},\widehat{u}) \equiv\frac{\sum\nolimits_{i}( \widehat{y}_{i}-\overline{y})(\widehat{u}_{i}-\overline{\widehat{u}}) }{n-1}}\)
\({cov(\widehat{y},\widehat{u}) = \frac{\sum\nolimits_{i}(\widehat{u}_{i}-0)(\widehat{y}_{I}-\overline{\widehat{y}_{i}})}{n-1}}\)
\({= \frac{\sum\nolimits_{i}\widehat{u}_{i}\widehat{y}_{i}}{n-1}-\frac{\sum\nolimits_{i}\widehat{u}_{i}\overline{\widehat{y}_{i}}}{n-1}}\)
\({= \frac{0}{n-1}-\frac{\overline{\widehat{y}_{i}}\sum\nolimits_{i}\widehat{u}_{i}}{n-1}}\)
\({0-\frac{\overline{\widehat{y}}(0)}{n-1}=0.}\)
2. The statement is FALSE. Chi-square tests of independence are actually adjusted for sample size N through the degrees of freedom (df) in the chi-square distribution. In a chi-square test of independence, the null hypothesis states that there is no association between two categorical variables. The alternative hypothesis suggests that there is an association between the variables. The chi-square test statistic is calculated by comparing the observed frequencies in each category with the expected frequencies under the assumption of independence. As the sample size N grows larger, the expected frequencies in each category tend to be larger as well. This results in a larger chi-square test statistic. However, the critical values of the chi-square distribution also increase as the degrees of freedom increase. Therefore, the adjustment for sample size through the degrees of freedom helps maintain the appropriate significance level for rejecting the null hypothesis.
The statement is FALSE. t-tests of the differences in group means are actually adjusted for sample size N through the calculation of the test statistic and the degrees of freedom.As the sample size N grows larger, the standard error of the difference (SE) decreases because the variances of the sample means decrease due to the law of large numbers. Consequently, the t-test statistic also decreases. However, the degrees of freedom increase with larger sample sizes, ensuring that the critical values of the t-distribution increase as well.Therefore, t-tests are adjusted for sample size N through the calculation of the test statistic and the degrees of freedom.
The statement is TRUE. The t-distribution approximates the Normal distribution as the sample size (N) becomes large.
The statement is FALSE. A t-test of the differences between the means of two groups with different sample sizes will not have the same likelihood of rejecting the null hypothesis in the scenarios described.The degrees of freedom in a t-test are determined by the sample sizes of the two groups. For an independent two-sample t-test, the degrees of freedom are calculated using the formula: \({df = (N_1 + N_2) - 2}\) In the first scenario where \({N_M = 100, N_W = 100}\), the degrees of freedom would be \({(100 + 100) - 2 = 198}\). In the second scenario where \({N_M = 10, N_W = 190}\), the degrees of freedom would be \({(10 + 190) - 2 = 198}\) as well. However, the likelihood of rejecting the null hypothesis is not solely determined by the degrees of freedom. It is also influenced by the magnitude of the observed difference, the level of significance chosen, and the critical value associated with that level of significance.The more balanced sample sizes in the first scenario are likely to yield a higher likelihood of rejecting the null hypothesis compared to the highly imbalanced sample sizes in the second scenario.
\[\overline{X}=48.5, s_X=10, N_X=100\;\text{ and }\;\overline{Y}=51.5, s_Y=10, N_Y=100.\]
You are using \(\overline{X}\) and \(\overline{Y}\) as estimates of the population parameters \(\mu_X\) and \(\mu_Y\) with the confidence intervals shown in the figure below.
For group X: \({CI_X = 48.5 ± t * (10 / √100)}\) Since \({N_X = 100}\) , the standard error \({(SE_X)}\) can be calculated as \({\frac{s_X}{\sqrt N_X} = 10 / √100 = 1}\). For a 95% confidence level, the critical value is approximately 1.97 given the sample size and the degrees of freedom of 198 Thus, the confidence interval for \(\mu_X\) becomes: \({CI_X = 48.5 ± 1.97 * 1}\)
For group Y, following the same logic, , the standard error \(SE_Y\) is also 1. Thus, the confidence interval for \(\mu_Y\) becomes: \({CI_Y = 51.5 ± 1.97 * 1}\) Therefore, the 95% confidence interval around \(\mu_X\) is (46.53, 50.47), and the 95% confidence interval around \(\mu_Y\) is (49.53, 53.47).
the confidence interval around the difference in means \({\mu_X-\mu_Y}\): \({\mu_X-\mu_Y = CI_Y - CI_X}\) \({= (51.5 ± 1.97) - (48.5 ± 1.97)}\)
Simplifying, we have: \({\mu_X-\mu_Y = (51.5 - 48.5) ± (1.97 + 1.97)}\) \({= 3 ± 3.94}\) The confidence interval around the difference in means is approximately (-0.94, 6.94).Since the confidence interval for the difference does not contain zero, it suggests that there is a statistically significant difference between the means of groups X and Y. This finding contradicts the claim that \({\mu_X = \mu_Y}\). Therefore, based on the confidence intervals, we have evidence to reject the claim that the two population means are equal.
The reporter is incorrect in concluding that there is no significant difference between the means of groups X and Y based solely on the overlap of the confidence intervals around the means. While overlapping confidence intervals suggest a lack of statistical significance between individual means, it does not guarantee the absence of a significant difference between the two means.In this case, the means of X and Y are 48.5 and 51.5, respectively, and the confidence intervals around these means overlap. However, the key consideration is the confidence interval around the difference in means \({\mu_X - \mu_Y}\) , which takes into account the relationship between the two groups.If this confidence interval does not contain zero, it suggests a statistically significant difference between the means.
The dataset contains many variables on the individuals interviewed. Choose six interesting variables and construct a correlation matrix with them. In three or four sentences, interpret what you see in the table. Do you see anything surprising? Anything expected? If you see statistically significant correlations, discuss what this means.
# load("/Users/abbeycho/Downloads/cces18_common_vv.RData")
# View(x)
# summary(x)
# selected_vars <- c("commonweight", "tookpost", "birthyr", "gender", "educ", "race")
# subset_data <- x[, selected_vars]
# cor_matrix <- cor(subset_data)
# print(cor_matrix)
Taking post-election survey and birthyear have strong negative correlations (-0.37093789). However, education and taking post-election survey have strong positive correlations(0.23714650). I expected the higher education level the repondents have, the more people would take post-election survey. I did not expect statistically significant correlations with weight and birthyear and it is close to 0 so they are not statistically significant correlated and more like independent.(0.01934450)
# x$CC18_417a_4_recode <- ifelse(x$CC18_417a_4 == 1, 1, 0)
# x$CC18_417a_4_recode
After recoding, CC18_417a_4 is only consisted of zero and one except missing values.
# contingency_table <- table(x$race, x$CC18_417a_4_recode)
# chi_square <- chisq.test(contingency_table)
# print(contingency_table)
# print (chi_square)
Pearson’s Chi-squared test
X-squared = 109.68, df = 7, p-value < 2.2e-16 The p-value is a lot less than 0.05, which means we can reject the null hypothesis that β = 0. Hence, there is a significant relationship between the variables race and engagement in protest in the linear regression model.
It is not appropriate to answer this question using a correlation coefficient because the variables “race” and “engagement in protest” are not continuous variables. Correlation coefficients are used to measure the strength and direction of linear relationships between two continuous variables. In this case, both variables are categorical, with “race” being a categorical variable and “engagement in protest” being a binary (0-1) variable.
# x$CC18_354b_1_recode <- ifelse(x$CC18_354b_1 == 1, 1, 0)
# x$CC18_354b_1_recode
# crosstabulation <- table(x$hispanic, x$CC18_354b_1_recode)
# chi_square_ah <- chisq.test(crosstabulation)
# print(crosstabulation)
# print (chi_square_ah)
# result_ah <- t.test(x$hispanic, x$CC18_354b_1_recode)
# print(result_ah)
Pearson’s Chi-squared test with Yates’ continuity correction
X-squared = 0.19076, df = 1, p-value = 0.6623
Welch Two Sample t-test
t = 933.9, df = 2718.5, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: (1.954112, 1.962335) sample estimates: mean of x, mean of y (1.9663494, 0.008126195) Both tests indicate a highly significant p-value (p < 0.05), suggesting strong evidence against the null hypothesis.However, it’s important to note that statistical significance alone does not provide information about the magnitude or practical relevance of the differences observed. While both tests indicate statistical significance, the contrast between statistically significant and substantively meaningful differences lies in the effect size or the magnitude of the difference between the group means.
In this case, the t-test suggests a highly significant difference in means, indicating that there is a substantial disparity between the two groups being compared. The effect size is represented by the mean difference between the groups, which can be interpreted as the average difference in the variable being measured.To determine the substantive meaningfulness, it is necessary to consider the context, domain-specific knowledge, and the practical implications of the observed difference.
#x$internethome_recode <- ifelse(x$internethome == 1, 1,
# ifelse(x$internethome == 2, 1, 0))
#x$healthins_7_recode <- ifelse(x$healthins_7 == 1, 1,
# ifelse(x$healthins_7 == 2, 0, 0))
#x$CC18_303_2_recode <- ifelse(x$CC18_303_2 == 1, 1,
# ifelse(x$CC18_303_2 == 2, 0, 0))
#x$CC18_303_9_recode <- ifelse(x$CC18_303_9 == 1, 1,
# ifelse(x$CC18_303_9 == 2, 0, 0))
#x$index <- rowSums(x[,c("internethome_recode", "healthins_7_recode", "CC18_303_2_recode", "CC18_303_9_recode")])
#summary(x$index)
#table(x$index)
#x$income <- ifelse(x$faminc_new == 1, 5000,
# ifelse(x$faminc_new == 2, 15000,
# ifelse(x$faminc_new == 3, 25000,
# ifelse(x$faminc_new == 4, 35000,
# ifelse(x$faminc_new == 5, 45000,
# ifelse(x$faminc_new == 6, 55000,
# ifelse(x$faminc_new == 7, 65000,
# ifelse(x$faminc_new == 8, 75000,
# ifelse(x$faminc_new == 9, 90000,
# ifelse(x$faminc_new == 10, 110000,
# ifelse(x$faminc_new == 11, 130000,
# ifelse(x$faminc_new == 12, 175000,
# ifelse(x$faminc_new == 13, 225000,
# ifelse(x$faminc_new == 14, 300000,
# ifelse(x$faminc_new == 15, 425000,
# ifelse(x$faminc_new == 16, 600000, NA))))))))))))))))
#print(x$income)
#crosstab_ii <- table(x$index, x$income)
#chi_square_ii <- chisq.test(crosstab_ii)
#print(crosstab_ii)
#print (chi_square_ii)
Pearson’s Chi-squared test
data: crosstab_ii X-squared = 2576.1, df = 60, p-value < 2.2e-16
Chi-square statistic (X-squared): In this example, the calculated chi-square value is 2576.1.Degrees of freedom (df): The degrees of freedom represent the number of categories in the contingency table minus one. In this case, it is 60 (assuming the crosstabulation has 61 categories).P-value: The p-value is reported as < 2.2e-16, indicating strong evidence against the null hypothesis of no association between income and the hardship index.
Based on these results, we can conclude that there is a statistically significant relationship between income and the hardship index. However, further analysis is needed to determine the nature and direction of this relationship (e.g., association or causation).
#correlation <- cor.test(x$income, x$index)
# print(correlation)
#cor_coef <- correlation$estimate
#p_value <- correlation$p.value
#print(cor_coef)
#print(p_value)
Pearson’s product-moment correlation
data: x\(income and x\)index t = -15.509, df = 53551, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: (-0.07529478, -0.05843151) sample estimates: cor -0.06686792 Correlation coefficient: -0.06686792 P-value: 3.976005e-54 The correlation coefficient (sample estimate) is -0.06686792, indicating a weak negative linear relationship between income and index. However, note that the correlation coefficient is close to zero, suggesting that the relationship is very weak.
The p-value is reported as < 2.2e-16, which is essentially zero. This extremely small p-value indicates that the observed correlation is statistically significant.
The test statistic (t) is -15.509, and the degrees of freedom (df) are 53551. The 95 percent confidence interval for the true correlation coefficient is calculated as -0.07529478 to -0.05843151. This interval suggests that we can be 95 percent confident that the true correlation between income and index falls within this range.
Based on these results, we can conclude that there is a statistically significant but weak negative correlation between income and index. This means that as income increases, there tends to be a slight decrease in the index of hardships. However, note that the correlation is weak, and other factors may have a more substantial influence on the index of hardships.