Quant 1 Summer 2023 Homework 3

install.packages(“pagedown”) output: pdf_document: pandoc_args: –pdf-engine=pagedown::chrome_print

Show “Property 4” from Lecture 3: that when we estimate a bivariate regression, the sample covariance between the fitted values $\widehat{y}$ and the computed residuals $\widehat{u}$ is always zero. [HINT: Use logic similar to that with which we showed that $cov(x,\widehat{u})=0$ in Lecture 3.] The predicted values (y-hat) are determined by the regression line and are independent of the residuals (u-hat). Therefore, (y-hatᵢ - mean(y-hat)) = 0 for all i.

${cov(\widehat{y},\widehat{u}) \equiv\frac{\sum\nolimits_{i}( \widehat{y}_{i}-\overline{y})(\widehat{u}_{i}-\overline{\widehat{u}}) }{n-1}}$

${cov(\widehat{y},\widehat{u}) = \frac{\sum\nolimits_{i}(\widehat{u}_{i}-0)(\widehat{y}_{I}-\overline{\widehat{y}_{i}})}{n-1}}$

${= \frac{\sum\nolimits_{i}\widehat{u}_{i}\widehat{y}_{i}}{n-1}-\frac{\sum\nolimits_{i}\widehat{u}_{i}\overline{\widehat{y}_{i}}}{n-1}}$

${= \frac{0}{n-1}-\frac{\overline{\widehat{y}_{i}}\sum\nolimits_{i}\widehat{u}_{i}}{n-1}}$

${0-\frac{\overline{\widehat{y}}(0)}{n-1}=0.}$

True, false, explain: Say whether each of the following statements is TRUE or FALSE. If the statement is false, use a few sentences (including mathematics, where necessary) to explain why it is false.

$\chi^2$-tests of the independence between two variables are not adjusted for sample size $N$, making it more likely that a $\chi^2$-test will reject the null of no differences between groups as $N$ grows larger.

2. The statement is FALSE. Chi-square tests of independence are actually adjusted for sample size N through the degrees of freedom (df) in the chi-square distribution. In a chi-square test of independence, the null hypothesis states that there is no association between two categorical variables. The alternative hypothesis suggests that there is an association between the variables. The chi-square test statistic is calculated by comparing the observed frequencies in each category with the expected frequencies under the assumption of independence. As the sample size N grows larger, the expected frequencies in each category tend to be larger as well. This results in a larger chi-square test statistic. However, the critical values of the chi-square distribution also increase as the degrees of freedom increase. Therefore, the adjustment for sample size through the degrees of freedom helps maintain the appropriate significance level for rejecting the null hypothesis.

$t$-tests of the differences in group means are not adjusted for sample size $N$, making it more likely that a $t$-test will reject the null of no differences between groups as $N$ grows larger.

The statement is FALSE. t-tests of the differences in group means are actually adjusted for sample size N through the calculation of the test statistic and the degrees of freedom.As the sample size N grows larger, the standard error of the difference (SE) decreases because the variances of the sample means decrease due to the law of large numbers. Consequently, the t-test statistic also decreases. However, the degrees of freedom increase with larger sample sizes, ensuring that the critical values of the t-distribution increase as well.Therefore, t-tests are adjusted for sample size N through the calculation of the test statistic and the degrees of freedom.

The $t$ distribution approximates the Normal distribution as $N$ becomes large.

The statement is TRUE. The t-distribution approximates the Normal distribution as the sample size (N) becomes large.

A $t$-test of the differences between the means of two groups $M$ and $W$ with a total sample size of 200 is just as likely to reject the null when ${N_M=100, N_W=100}$ as when ${N_M=10, N_W=190}$.

The statement is FALSE. A t-test of the differences between the means of two groups with different sample sizes will not have the same likelihood of rejecting the null hypothesis in the scenarios described.The degrees of freedom in a t-test are determined by the sample sizes of the two groups. For an independent two-sample t-test, the degrees of freedom are calculated using the formula: ${df = (N_1 + N_2) - 2}$ In the first scenario where ${N_M = 100, N_W = 100}$, the degrees of freedom would be ${(100 + 100) - 2 = 198}$. In the second scenario where ${N_M = 10, N_W = 190}$, the degrees of freedom would be ${(10 + 190) - 2 = 198}$ as well. However, the likelihood of rejecting the null hypothesis is not solely determined by the degrees of freedom. It is also influenced by the magnitude of the observed difference, the level of significance chosen, and the critical value associated with that level of significance.The more balanced sample sizes in the first scenario are likely to yield a higher likelihood of rejecting the null hypothesis compared to the highly imbalanced sample sizes in the second scenario.

Consider the sample means of two groups $X$ and $Y$, where $\overline{X} < \overline{Y}$. An interesting and important property of these means as estimates for the parameters $\mu _{X}$ and $\mu _{Y}$ is that sometimes the individual confidence intervals constructed about the means $\mu _{X}$ and $\mu _{Y}$ overlap, but the confidence interval constructed about the difference in these means $( \mu _{Y}-\mu _{X})$ does not contain zero.

\[\overline{X}=48.5, s_X=10, N_X=100\;\text{ and }\;\overline{Y}=51.5, s_Y=10, N_Y=100.\]

You are using $\overline{X}$ and $\overline{Y}$ as estimates of the population parameters $\mu_X$ and $\mu_Y$ with the confidence intervals shown in the figure below.

Calculate the 95% confidence intervals around $\mu_X$ and $\mu_Y$.

For group X: ${CI_X = 48.5 ± t * (10 / √100)}$ Since ${N_X = 100}$ , the standard error ${(SE_X)}$ can be calculated as ${\frac{s_X}{\sqrt N_X} = 10 / √100 = 1}$. For a 95% confidence level, the critical value is approximately 1.97 given the sample size and the degrees of freedom of 198 Thus, the confidence interval for $\mu_X$ becomes: ${CI_X = 48.5 ± 1.97 * 1}$

For group Y, following the same logic, , the standard error $SE_Y$ is also 1. Thus, the confidence interval for $\mu_Y$ becomes: ${CI_Y = 51.5 ± 1.97 * 1}$ Therefore, the 95% confidence interval around $\mu_X$ is (46.53, 50.47), and the 95% confidence interval around $\mu_Y$ is (49.53, 53.47).

Evaluate the claim that $\mu_X = \mu_Y.$

the confidence interval around the difference in means ${\mu_X-\mu_Y}$: ${\mu_X-\mu_Y = CI_Y - CI_X}$ ${= (51.5 ± 1.97) - (48.5 ± 1.97)}$

Simplifying, we have: ${\mu_X-\mu_Y = (51.5 - 48.5) ± (1.97 + 1.97)}$ ${= 3 ± 3.94}$ The confidence interval around the difference in means is approximately (-0.94, 6.94).Since the confidence interval for the difference does not contain zero, it suggests that there is a statistically significant difference between the means of groups X and Y. This finding contradicts the claim that ${\mu_X = \mu_Y}$. Therefore, based on the confidence intervals, we have evidence to reject the claim that the two population means are equal.

A newspaper reporter looks at the diagram above and writes ``Because the confidence intervals around the means for groups $X$ and $Y$ overlap, there is no significant difference between them.’’ Is the reporter correct? Say why or why not in a few sentences.

The reporter is incorrect in concluding that there is no significant difference between the means of groups X and Y based solely on the overlap of the confidence intervals around the means. While overlapping confidence intervals suggest a lack of statistical significance between individual means, it does not guarantee the absence of a significant difference between the two means.In this case, the means of X and Y are 48.5 and 51.5, respectively, and the confidence intervals around these means overlap. However, the key consideration is the confidence interval around the difference in means ${\mu_X - \mu_Y}$ , which takes into account the relationship between the two groups.If this confidence interval does not contain zero, it suggests a statistically significant difference between the means.

Conduct the following analyses using the 2018 Cooperative Congressional Election Study dataset. It contains interviews with 60,000 U.S. adults conducted in Fall 2018 around the nation’s midterm elections. Download the dataset (``’’) in RData format here:

The dataset contains many variables on the individuals interviewed. Choose six interesting variables and construct a correlation matrix with them. In three or four sentences, interpret what you see in the table. Do you see anything surprising? Anything expected? If you see statistically significant correlations, discuss what this means.

# load("/Users/abbeycho/Downloads/cces18_common_vv.RData")
# View(x)
# summary(x)

# selected_vars <- c("commonweight", "tookpost", "birthyr", "gender", "educ", "race")

# subset_data <- x[, selected_vars]


# cor_matrix <- cor(subset_data)

# print(cor_matrix)

Taking post-election survey and birthyear have strong negative correlations (-0.37093789). However, education and taking post-election survey have strong positive correlations(0.23714650). I expected the higher education level the repondents have, the more people would take post-election survey. I did not expect statistically significant correlations with weight and birthyear and it is close to 0 so they are not statistically significant correlated and more like independent.(0.01934450)

Do Americans of different races and ethnicities (variable ) engage in protest () at significantly different rates?

First, recode to zero-one in order to make some of the following results more easily interpretable.

# x$CC18_417a_4_recode <- ifelse(x$CC18_417a_4 == 1, 1, 0)

# x$CC18_417a_4_recode

After recoding, CC18_417a_4 is only consisted of zero and one except missing values.

Answer this question with a crosstab and a chi-square test for independence. Interpret your results.

# contingency_table <- table(x$race, x$CC18_417a_4_recode)

# chi_square <- chisq.test(contingency_table)

# print(contingency_table)

# print (chi_square)

Pearson’s Chi-squared test

X-squared = 109.68, df = 7, p-value < 2.2e-16 The p-value is a lot less than 0.05, which means we can reject the null hypothesis that β = 0. Hence, there is a significant relationship between the variables race and engagement in protest in the linear regression model.

Why is it not appropriate to answer this question by computing a correlation coefficient?

It is not appropriate to answer this question using a correlation coefficient because the variables “race” and “engagement in protest” are not continuous variables. Correlation coefficients are used to measure the strength and direction of linear relationships between two continuous variables. In this case, both variables are categorical, with “race” being a categorical variable and “engagement in protest” being a binary (0-1) variable.

Now compare the protest rates of just Asian/Pacific and Hispanic/Latino Americans to one another.
- First do so with a crosstabulation and a chi-square test for independence.
- Now do so with a difference-in-means test.
- What is similar about the results of these two tests?
- In a few sentences, use the comparison of these two groups to discuss the contrast between statistically significant and substantively meaningful differences between group means.

# x$CC18_354b_1_recode <- ifelse(x$CC18_354b_1 == 1, 1, 0)

# x$CC18_354b_1_recode

# crosstabulation <- table(x$hispanic, x$CC18_354b_1_recode)

# chi_square_ah <- chisq.test(crosstabulation)

# print(crosstabulation)

# print (chi_square_ah)

# result_ah <- t.test(x$hispanic, x$CC18_354b_1_recode)

# print(result_ah)

Pearson’s Chi-squared test with Yates’ continuity correction

X-squared = 0.19076, df = 1, p-value = 0.6623

Welch Two Sample t-test

t = 933.9, df = 2718.5, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: (1.954112, 1.962335) sample estimates: mean of x, mean of y (1.9663494, 0.008126195) Both tests indicate a highly significant p-value (p < 0.05), suggesting strong evidence against the null hypothesis.However, it’s important to note that statistical significance alone does not provide information about the magnitude or practical relevance of the differences observed. While both tests indicate statistical significance, the contrast between statistically significant and substantively meaningful differences lies in the effect size or the magnitude of the difference between the group means.

In this case, the t-test suggests a highly significant difference in means, indicating that there is a substantial disparity between the two groups being compared. The effect size is represented by the mean difference between the groups, which can be interpreted as the average difference in the variable being measured.To determine the substantive meaningfulness, it is necessary to consider the context, domain-specific knowledge, and the practical implications of the observed difference.

Do higher-income households encounter fewer hardships? We’ll examine this question with the same dataset.

Construct an index of hardship, , using the following four variables. This new variable will range from zero to four, indicating the total number of hardships.
- No internet access at home ()
- No health insurance ()
- Lost a job in past year ()
- Been a victim of a crime in past year ()
Household income is recorded as the variable . It is coded from 1 to 16, with each value representing an income category. We need a version of this variable that is actual family income. Create a new variable where each value is the midpoint of the income category listed in . So:
- where $=1$, $=\$5,000$;
- where $=2$, $=\$15,000$;
- where $=3$, $=\$25,000$;
- and so on, up to the highest category of (16 = $500,000 or more). Make this equal to $\$600,000$ in the variable .
- Be sure to recode the missing values (scored 97) appropriately.

Examine the relationship between and with a crosstab and chi-square test. Interpret your results.

#x$internethome_recode <- ifelse(x$internethome == 1, 1,
#                                ifelse(x$internethome == 2, 1, 0))

#x$healthins_7_recode <- ifelse(x$healthins_7 == 1, 1,
#                               ifelse(x$healthins_7 == 2, 0, 0))

#x$CC18_303_2_recode <- ifelse(x$CC18_303_2 == 1, 1,
#                              ifelse(x$CC18_303_2 == 2, 0, 0))

#x$CC18_303_9_recode <- ifelse(x$CC18_303_9 == 1, 1,
#                              ifelse(x$CC18_303_9 == 2, 0, 0))

#x$index <- rowSums(x[,c("internethome_recode", "healthins_7_recode", "CC18_303_2_recode", "CC18_303_9_recode")])

#summary(x$index)

#table(x$index)



#x$income <- ifelse(x$faminc_new == 1, 5000,
#                   ifelse(x$faminc_new == 2, 15000,
#                          ifelse(x$faminc_new == 3, 25000,
#                                 ifelse(x$faminc_new == 4, 35000,
#                                        ifelse(x$faminc_new == 5, 45000, 
#                                               ifelse(x$faminc_new == 6, 55000,
#                                                      ifelse(x$faminc_new == 7, 65000,
#                                                             ifelse(x$faminc_new == 8, 75000,
#                                                                    ifelse(x$faminc_new == 9, 90000,
#                                                                           ifelse(x$faminc_new == 10, 110000,
#                                                                                  ifelse(x$faminc_new == 11, 130000,
#                                                                                         ifelse(x$faminc_new == 12, 175000,
#                                                                                                ifelse(x$faminc_new == 13, 225000, 
#                                                                                                       ifelse(x$faminc_new == 14, 300000,
#                                                                                                              ifelse(x$faminc_new == 15, 425000,
#                                                                                                                     ifelse(x$faminc_new == 16, 600000, NA))))))))))))))))

#print(x$income)

#crosstab_ii <- table(x$index, x$income)

#chi_square_ii <- chisq.test(crosstab_ii)

#print(crosstab_ii)

#print (chi_square_ii)

Pearson’s Chi-squared test

data: crosstab_ii X-squared = 2576.1, df = 60, p-value < 2.2e-16

Chi-square statistic (X-squared): In this example, the calculated chi-square value is 2576.1.Degrees of freedom (df): The degrees of freedom represent the number of categories in the contingency table minus one. In this case, it is 60 (assuming the crosstabulation has 61 categories).P-value: The p-value is reported as < 2.2e-16, indicating strong evidence against the null hypothesis of no association between income and the hardship index.

Based on these results, we can conclude that there is a statistically significant relationship between income and the hardship index. However, further analysis is needed to determine the nature and direction of this relationship (e.g., association or causation).

Examine the relationship between and by calculating a correlation coefficient and significance test. Interpret your results.

#correlation <- cor.test(x$income, x$index)
# print(correlation)
#cor_coef <- correlation$estimate
#p_value <- correlation$p.value


#print(cor_coef)
#print(p_value)

Pearson’s product-moment correlation

data: x$income and x$index t = -15.509, df = 53551, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: (-0.07529478, -0.05843151) sample estimates: cor -0.06686792 Correlation coefficient: -0.06686792 P-value: 3.976005e-54 The correlation coefficient (sample estimate) is -0.06686792, indicating a weak negative linear relationship between income and index. However, note that the correlation coefficient is close to zero, suggesting that the relationship is very weak.

The p-value is reported as < 2.2e-16, which is essentially zero. This extremely small p-value indicates that the observed correlation is statistically significant.

The test statistic (t) is -15.509, and the degrees of freedom (df) are 53551. The 95 percent confidence interval for the true correlation coefficient is calculated as -0.07529478 to -0.05843151. This interval suggests that we can be 95 percent confident that the true correlation between income and index falls within this range.

Based on these results, we can conclude that there is a statistically significant but weak negative correlation between income and index. This means that as income increases, there tends to be a slight decrease in the index of hardships. However, note that the correlation is weak, and other factors may have a more substantial influence on the index of hardships.

Quant 1 Summer 2023 Homework 3

Due on Friday, June 2 at 5p.m.