Overview: Using methods of statistical inference, namely, the Chi-Squared test, ANOVA test, Hypothesis Testing and constructing Confidence Intervals, I’ve made an attempt at investigating how race, political views, and economic events affect the financial perceptions of the american public.
knitr::opts_chunk$set(echo = TRUE)library(tidyverse)## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.5
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(stringr)
library(statsr)load("gss.RData")Brief: Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions.
The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972.
Sampling Deisgn: The sampling design takes three distinct forms:
Modified Probability Sample before 1975: Block Quota/Stratified Sampling was deployed. Although cheaper and more convenient, it generates sample biases mainly due to not-at-homes which are not controlled by the quotas. To reduce the bias, interviews were only conducted weekends and after 3:00 PM on weekdays.
Transitional Sample design (one-half full probability and one-half block quota) in 1975 & 1976.
Note: I have included data for the years before 1976 due to the measures taken to tackle sampling biases and the credibility of the organisation.
The GSS did not perform experiments on randomly sampled subjects to derive data. Hence, while there was random sampling, there was no random assignment. The selected individuals were not divided into control and treatment groups to be treated differently (which is the crux of a random assignment).
Therefore, we can conclude that the data are generalisable to the whole population, but no causality can be drawn.
Data Collection: Data was collected through face-to-face interviews as far as possible. After 2002, computer-assisted personal interview (CAPI) was adopted. In certain cases when personal interviews were not possible, telephone interviews were conducted.
Citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11.
Persistent URL: http://doi.org/10.3886/ICPSR34802.v1
I’m interested in investigating how race, political views, and economic events affect the financial perceptions of the american public. Based on that, I’ve formulated 3 research objectives.
Objective I: It is generally accepted that black people, on average, have weaker financial backgrounds than their white counterparts due to racial history. I’m interested to find out whether race affects the personal financial satisfaction levels of the public.
Objective II: I want to explore the confidence levels in american banking and financial institutions and how it has varied over the years. Particularly, I want to compare the proportion of positive confidence level in financial instiutions in 2008 to that in 2012. It is common knowledge that there was a global economic crisis in 2008.
I want to try linking these levels to the corresponding economic health of the country and analyse whether they agree.
Objective III: Americans tend to be natioanlistic and political. Is there a difference in the income levels of americans affiliated with different political parties & views, and is that difference statistically significant?
In this part, I will explore variables of interest in order to identify trends and produce relevant graphical & numerical summaries.
Brief: Does race affect the personal financial satisfaction levels of the public?
Variables of interest:
satfin: Records whether the respondent is personally satisfied with their financial situation. (satisfied/more or less/not at all)race: Records the race of the respondent. (white/black/other)#creating a data frame to store the proportions of levels in satfin
fin_sat<-gss%>%filter(!is.na(satfin))%>%group_by(race)%>%summarise(satisfied = sum(satfin == "Satisfied")/n(), somewhat_satisfied = sum(satfin == "More Or Less")/n(), not_satisfied = sum(satfin == "Not At All Sat")/n())
#creating a tidy data frame for plotting
fin_sat_tidy<-gather(fin_sat, key = satisfaction_level, value = value, -race)
#plotting the data
ggplot(fin_sat_tidy, aes(race, value, fill = satisfaction_level)) + geom_col() + labs(x="Race", y="Financial Satisfaction Level")As can be observed, we have 2 categorical variables of interest, race and satfin, each of which contains 3 factor levels.
Looking at the graph, it appears that proportionally, black people are the most unsatisfied with their financial situation. White respondents seem to be the most satisfied.
We need to perform relevant statistical inferences to test the above observations.
Step I: State Hypotheses at \(\alpha\) = 0.05
H0: Race and financial satisfaction levels are independent of each other.
HA: Race and financial satisfaction levels are dependent on each other.
Step II: Conditions for \(\chi^2\) test
This test has 2 conditions as follows:
Independence: Since random sampling was used (as discussed in Part 1), we can conclude independence.
Sample Size: According to this condition, each particular scenario must have at least 5 expected cases.
#creating relevant data frame
provisional<-select(gss, race, satfin)
#filtering out NA values
provisional<-filter(provisional, !is.na(satfin))
table(provisional$race, provisional$satfin)##
## Satisfied More Or Less Not At All Sat
## White 13455 19095 10271
## Black 1362 2918 2960
## Other 527 1163 703
Each scenario most certainly has more than 5 cases, so the condition is met.
Step III: Methodology
There are 2 categorical variables, both having 3 levels. This calls for a \(\chi^2\) test.
where r = no. of rows and c = no. of columns
Step IV: Inference
df = (3-1)*(3-1)#based on the above table
df## [1] 4
#applying the chi-squared test
chisq.test(provisional$race, provisional$satfin)##
## Pearson's Chi-squared test
##
## data: provisional$race and provisional$satfin
## X-squared = 1091.4, df = 4, p-value < 2.2e-16
The p-value is practically 0, which is far less than the significant level of 5%. Therefore, we can reject H0 in favour of HA.
We can also conduct a hypothesis test using the inference() function.
inference(x = race, y = satfin, data = provisional, type = "ht", statistic = "proportion", method = "theoretical", sig_level = 0.05, success = "Satisfied", alternative = "greater" )## Response variable: categorical (3 levels)
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x Satisfied More Or Less Not At All Sat
## White 13455 19095 10271
## Black 1362 2918 2960
## Other 527 1163 703
##
## Expected:
## y
## x Satisfied More Or Less Not At All Sat
## White 12526.1262 18919.806 11375.068
## Black 2117.8663 3198.884 1923.250
## Other 700.0075 1057.311 635.682
##
## H0: race and satfin are independent
## HA: race and satfin are dependent
## chi_sq = 1091.4207, df = 4, p_value = 0
This result agrees with our previous chisq.test() result.
Step V: Interpretation
Because the reported p-value is 0, far less than the \(\alpha\)-value, we can conclude that race and financial satisfaction are indeed largely dependent on each other. This conclusion is line with our preliminary visual interpretation.
Brief: Compare the proportion of positive confidence level in financial instiutions in 2008 to that in 2012.
Variables of Interest:
year: year in which the survey was performed.confinan: confidence level in financial instiutions and banks.#creating relevant data frame
financial_confidence<-select(gss, year, confinan)
#filtering out NA values
financial_confidence<-filter(financial_confidence, !is.na(confinan), year)
#plotting confidence levels by year
ggplot(financial_confidence, aes(year, fill = confinan)) + geom_bar() + labs(x="Year", y="Confidence Levels")Negative confidence levels (i.e. "Hardly Any") are on the rise since 1972, as can be seen from the blue-coloured bar in the above graph.
Next, I’ll plot the proportion of positive confidence levels in financial institutions (i.e. "A Great Deal") per year.
#data frame to store proportional value of positive confidence
proportional_financial_confidence<-financial_confidence%>%group_by(year)%>%summarise(confident = sum(confinan == "A Great Deal")/n())
#plotting proportion of positive confidence by year
ggplot(proportional_financial_confidence, aes(year, confident)) + geom_col() + labs(x="Year", y="Proportional Confidence Levels")The plot, in general, follows the US Economy patterns. For example, the confidence in financial institutions drops around 1990, which is in line with the US Economy Recession.
At the same time, the confidence levels rise from 1993 to 2000 when the USA experienced the Economic Boom: an extended period of economic prosperity, during which GDP increased continuously for almost ten years.
Visually, the proportion of positive confidence level in 2008 is actually greater than that in 2012. This is inconsistent with our expectations (we expected +ve confidence levels to plummet in 2008 due to the economic crisis).
Perhaps there is a confounding variable at play here, but we cannot determine the cause of inconsistency from the given data.
I want to determine whether the difference in proportions is statistically signifant, or whether it is simply due to variability. This calls for another statistical inference execution.
For the purpose of this inference, I will assume "A Great Deal" as success, and "Only Some" & "Hardly Any" as failures.
#investigating years 2008 & 2012
fincon_06_08<-filter(financial_confidence, year ==2008|year==2012)
#creating backup data frame
fincon_backup<-fincon_06_08
#replacing "Hardly Any" and "Only Some" levels with "failure" level
fincon_06_08<-str_replace(fincon_06_08$confinan, "Hardly Any", "failure")
fincon_backup$confinan<-fincon_06_08
fincon_06_08<-str_replace(fincon_backup$confinan, "Only Some", "failure")
fincon_backup$confinan<-fincon_06_08
fincon_backup$confinan<-as.factor(fincon_backup$confinan)
fincon_backup_summary<-fincon_backup%>%group_by(year)%>%summarise(prop_success = sum(confinan == "A Great Deal")/n())
#plotting proportion of positive confidence in 2008 & 2012
ggplot(fincon_backup_summary, aes(year, prop_success)) + geom_col() + labs(x="Year", y="Proportion of Success")fincon_backup$year<-as.factor(fincon_backup$year)Step I: State Hypotheses at \(\alpha\) = 0.05
H0: p2008 = p2012 (there is no difference in the proportion of success in 2008 and 2012)
HA: p2008 > p2012 (2008 confidence levels are greater than those of 2012)
Step II: Conditions for Central Limit Theorem
The conditions are as follows:
fincon_backup%>%group_by(year)%>%summarise(n())## # A tibble: 2 x 2
## year `n()`
## <fct> <int>
## 1 2008 1350
## 2 2012 1325
fincon_backup_summary## # A tibble: 2 x 2
## year prop_success
## <int> <dbl>
## 1 2008 0.195
## 2 2012 0.112
n2008 = 1350
n2012 = 1325
We can safely conclude that these sample sizes are less than 10% of the population of the United States, therefore this condition is met.
(n2008)(\(\hat{p}\)2008) = 1350*0.195 = 263.5
(n2012)(\(\hat{p}\)2012) = 0.112*1325 = 148.4
As we can see, the success-failure condition is met in both the samples.
Thus, we can use the Central Limit Theorem.
Step III: Methodology
There are 2 categorical variables, both being binary. Therefore, we use the z-statistic to calculate the p-value.
The Confidence Interval for the difference in 2 population proportions (Z = 1.96 for 95% Confidence Level)
where Margin of Error (ME) = Z(S.E.)*
Step IV: Inference
inference(y = confinan, x = year, data = fincon_backup, statistic = "proportion", type = "ht", null = 0, alternative = "greater", method = "theoretical", success = "A Great Deal")## Response variable: categorical (2 levels, success: A Great Deal)
## Explanatory variable: categorical (2 levels)
## n_2008 = 1350, p_hat_2008 = 0.1948
## n_2012 = 1325, p_hat_2012 = 0.1125
## H0: p_2008 = p_2012
## HA: p_2008 > p_2012
## z = 5.9003
## p_value = < 0.0001
Step V: Interpretation
The p-value is less than \(\alpha\) = 0.05 significance level, which means we can reject the H0 in favour of the HA.
The probability of obtaining a difference in sample proportions of 8.23% or greater, if in fact there was no difference in the sampled proportions, is less than 0.0001.
Thus, there is convincing evidence in favour of HA and the difference between the confidence levels is indeed statistically signifant.
As corroborative proof, let’s calculate the Confidence Interval for the same data:
inference(y = confinan, x = year, data = fincon_backup, statistic = "proportion", type = "ci", method = "theoretical", success = "A Great Deal")## Response variable: categorical (2 levels, success: A Great Deal)
## Explanatory variable: categorical (2 levels)
## n_2008 = 1350, p_hat_2008 = 0.1948
## n_2012 = 1325, p_hat_2012 = 0.1125
## 95% CI (2008 - 2012): (0.0552 , 0.1095)
Brief: Is there a difference in the income levels of americans affiliated with different political parties & views, and is that difference statistically significant?
Variables of Interest:
partyid: Respondent’s political party affiliation.
coninc: Respondent’s inflation-adjusted family income, in dollars.
income_by_party<-gss%>%filter(!is.na(partyid), !is.na(coninc))%>%select(partyid, coninc)
hist(income_by_party$coninc, xlab = "Inflation-adjusted Family Income", main = "Family Income in Constant Dollars")The income graph is heavily right-skewed. Ideally, I would like to use the median as a meausre of central tendancy, but because I want to only investigate the difference in income levels and not the actual income levels, I will utilise mean of the sample so that I can use ANOVA tests.
Moving on to Statistical Inference:
Step I: State Hypotheses at \(\alpha\) = 0.05
H0: The mean is same across all the categories. Since we have 8 levels in the partyid variable:
\(\mu\)1 = \(\mu\)2 = \(\mu\)3 = \(\mu\)4 = \(\mu\)5 = \(\mu\)6 = \(\mu\)7 = \(\mu\)8
HA: There exists at least one pair of means in which the means are not equal.
Step II: Conditions for ANOVA
Thus, we can conclude that the samples are independent.
income_by_party%>%group_by(partyid)%>%summarise(n())## # A tibble: 8 x 2
## partyid `n()`
## <fct> <int>
## 1 Strong Democrat 8232
## 2 Not Str Democrat 10995
## 3 Ind,Near Dem 6201
## 4 Independent 7223
## 5 Ind,Near Rep 4525
## 6 Not Str Republican 8166
## 7 Strong Republican 4940
## 8 Other Party 775
This condition demands that the groups have roughly equal variability.
As we can see from the following side-by-side boxplots, all of the groups have similar variabilities except for the "Strong Republican" group, which has a slightly greater variability. So we can assume this condition to be met.
ggplot(income_by_party, aes(partyid, coninc)) + geom_boxplot() + theme(axis.text.x=element_text(angle=40, hjust=1)) + labs(x="Political Party Affiliation", y="Inflation-adjusted Family Income")Step III: Methodology
Step IV: Inference
inference(y = coninc, x = partyid, data = income_by_party, statistic = "mean", type = "ht", method = "theoretical", alternative = "greater")## Response variable: numerical
## Explanatory variable: categorical (8 levels)
## n_Strong Democrat = 8232, y_bar_Strong Democrat = 37970.1833, s_Strong Democrat = 33195.5316
## n_Not Str Democrat = 10995, y_bar_Not Str Democrat = 41936.0819, s_Not Str Democrat = 33045.6652
## n_Ind,Near Dem = 6201, y_bar_Ind,Near Dem = 42827.115, s_Ind,Near Dem = 34179.2018
## n_Independent = 7223, y_bar_Independent = 38974.2352, s_Independent = 33341.2145
## n_Ind,Near Rep = 4525, y_bar_Ind,Near Rep = 49183.8608, s_Ind,Near Rep = 37136.5051
## n_Not Str Republican = 8166, y_bar_Not Str Republican = 51707.0563, s_Not Str Republican = 38281.6394
## n_Strong Republican = 4940, y_bar_Strong Republican = 55154.2346, s_Strong Republican = 41427.8594
## n_Other Party = 775, y_bar_Other Party = 45588.52, s_Other Party = 38599.9933
##
## ANOVA:
## df Sum_Sq Mean_Sq F p_value
## partyid 7 1746245054827.48 249463579261.069 198.4193 < 0.0001
## Residuals 51049 64181598251474.8 1257254760.1613
## Total 51056 65927843306302.3
##
## Pairwise tests - t tests with pooled SD:
## group1 group2 p.value
## 1 Not Str Democrat Strong Democrat 1.696e-14
## 2 Ind,Near Dem Strong Democrat 3.839e-16
## 3 Independent Strong Democrat 7.903e-02
## 4 Ind,Near Rep Strong Democrat 2.711e-65
## 5 Not Str Republican Strong Democrat 5.019e-135
## 6 Strong Republican Strong Democrat 1.333e-158
## 7 Other Party Strong Democrat 1.082e-08
## 8 Ind,Near Dem Not Str Democrat 1.136e-01
## 9 Independent Not Str Democrat 3.502e-08
## 10 Ind,Near Rep Not Str Democrat 6.157e-31
## 11 Not Str Republican Not Str Democrat 4.244e-79
## 12 Strong Republican Not Str Democrat 1.515e-104
## 13 Other Party Not Str Democrat 5.580e-03
## 14 Independent Ind,Near Dem 3.489e-10
## 15 Ind,Near Rep Ind,Near Dem 4.925e-20
## 16 Not Str Republican Ind,Near Dem 6.771e-50
## 17 Strong Republican Ind,Near Dem 5.146e-74
## 18 Other Party Ind,Near Dem 4.095e-02
## 19 Ind,Near Rep Independent 5.567e-52
## 20 Not Str Republican Independent 5.603e-109
## 21 Strong Republican Independent 4.497e-134
## 22 Other Party Independent 8.039e-07
## 23 Not Str Republican Ind,Near Rep 1.233e-04
## 24 Strong Republican Ind,Near Rep 2.836e-16
## 25 Other Party Ind,Near Rep 9.103e-03
## 26 Strong Republican Not Str Republican 6.934e-08
## 27 Other Party Not Str Republican 4.424e-06
## 28 Other Party Strong Republican 2.935e-12
Step V: Interpretation
Because of the small p-value (<\(\alpha\) = 0.05), we reject the null hypothesis in favour of the HA.
Thus, there is convincing evidence of a difference in means of at least one pair of means. However, we cannot identify which pair of means are different with an ANOVA test.
Inferences and interpretations have already been discussed in detail in part 3 under each Research Objective. The purpose of this part is to recap the interpretations and conclude the analysis of the GSS Data.
Interpretation for Objective I
Interpretation for Objective II
Interpretation for Objective III
We tested for a difference in at least one pair of means of the income levels of groups of people with various poliitical party affiliatoins. Because of the small p-value (<\(\alpha\) = 0.05), we rejected the null hypothesis in favour of the HA.
However, the ANOVA test does not provide information on which we which pair of means are different.