Statistical Inference of The General Social Survey Data

Overview: Using methods of statistical inference, namely, the Chi-Squared test, ANOVA test, Hypothesis Testing and constructing Confidence Intervals, I’ve made an attempt at investigating how race, political views, and economic events affect the financial perceptions of the american public.

Setup

knitr::opts_chunk$set(echo = TRUE)

Load Packages

library(tidyverse)

## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.5
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(stringr)
library(statsr)

Load the GSS Data

load("gss.RData")

Part 1: Data

Brief: Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions.

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972.

Sampling Deisgn: The sampling design takes three distinct forms:

Modified Probability Sample before 1975: Block Quota/Stratified Sampling was deployed. Although cheaper and more convenient, it generates sample biases mainly due to not-at-homes which are not controlled by the quotas. To reduce the bias, interviews were only conducted weekends and after 3:00 PM on weekdays.
Transitional Sample design (one-half full probability and one-half block quota) in 1975 & 1976.

Note: I have included data for the years before 1976 due to the measures taken to tackle sampling biases and the credibility of the organisation.

Full Probability Sample after 1976: Simple Random Sampling is costlier and more tedious, but more representative of the population, and hence yields better analyses.

The GSS did not perform experiments on randomly sampled subjects to derive data. Hence, while there was random sampling, there was no random assignment. The selected individuals were not divided into control and treatment groups to be treated differently (which is the crux of a random assignment).

Therefore, we can conclude that the data are generalisable to the whole population, but no causality can be drawn.

Data Collection: Data was collected through face-to-face interviews as far as possible. After 2002, computer-assisted personal interview (CAPI) was adopted. In certain cases when personal interviews were not possible, telephone interviews were conducted.

Citation: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11.

Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

Part 2: Research Objectives

I’m interested in investigating how race, political views, and economic events affect the financial perceptions of the american public. Based on that, I’ve formulated 3 research objectives.

Objective I: It is generally accepted that black people, on average, have weaker financial backgrounds than their white counterparts due to racial history. I’m interested to find out whether race affects the personal financial satisfaction levels of the public.

Objective II: I want to explore the confidence levels in american banking and financial institutions and how it has varied over the years. Particularly, I want to compare the proportion of positive confidence level in financial instiutions in 2008 to that in 2012. It is common knowledge that there was a global economic crisis in 2008.

I want to try linking these levels to the corresponding economic health of the country and analyse whether they agree.

Objective III: Americans tend to be natioanlistic and political. Is there a difference in the income levels of americans affiliated with different political parties & views, and is that difference statistically significant?

Part 3: Exploratory Data Analysis

In this part, I will explore variables of interest in order to identify trends and produce relevant graphical & numerical summaries.

Research Objective I: Personal Financial Situation

Brief: Does race affect the personal financial satisfaction levels of the public?

Variables of interest:

satfin: Records whether the respondent is personally satisfied with their financial situation. (satisfied/more or less/not at all)
race: Records the race of the respondent. (white/black/other)

#creating a data frame to store the proportions of levels in satfin
fin_sat<-gss%>%filter(!is.na(satfin))%>%group_by(race)%>%summarise(satisfied = sum(satfin == "Satisfied")/n(), somewhat_satisfied = sum(satfin == "More Or Less")/n(), not_satisfied = sum(satfin == "Not At All Sat")/n())

#creating a tidy data frame for plotting
fin_sat_tidy<-gather(fin_sat, key = satisfaction_level, value = value, -race)

#plotting the data
ggplot(fin_sat_tidy, aes(race, value, fill = satisfaction_level)) + geom_col() + labs(x="Race", y="Financial Satisfaction Level")

As can be observed, we have 2 categorical variables of interest, race and satfin, each of which contains 3 factor levels.

Looking at the graph, it appears that proportionally, black people are the most unsatisfied with their financial situation. White respondents seem to be the most satisfied.

We need to perform relevant statistical inferences to test the above observations.

Step I: State Hypotheses at \(\alpha\) = 0.05

H₀: Race and financial satisfaction levels are independent of each other.
H_A: Race and financial satisfaction levels are dependent on each other.

Step II: Conditions for \(\chi^2\) test

This test has 2 conditions as follows:

Independence: Since random sampling was used (as discussed in Part 1), we can conclude independence.
Sample Size: According to this condition, each particular scenario must have at least 5 expected cases.

#creating relevant data frame
provisional<-select(gss, race, satfin)

#filtering out NA values
provisional<-filter(provisional, !is.na(satfin))

table(provisional$race, provisional$satfin)

##        
##         Satisfied More Or Less Not At All Sat
##   White     13455        19095          10271
##   Black      1362         2918           2960
##   Other       527         1163            703

Each scenario most certainly has more than 5 cases, so the condition is met.

Step III: Methodology

There are 2 categorical variables, both having 3 levels. This calls for a \(\chi^2\) test.

Degrees of Freedom = (r-1)*(c-1)

where r = no. of rows and c = no. of columns

Step IV: Inference

df = (3-1)*(3-1)#based on the above table
df

## [1] 4

#applying the chi-squared test
chisq.test(provisional$race, provisional$satfin)

## 
##  Pearson's Chi-squared test
## 
## data:  provisional$race and provisional$satfin
## X-squared = 1091.4, df = 4, p-value < 2.2e-16

The p-value is practically 0, which is far less than the significant level of 5%. Therefore, we can reject H₀ in favour of H_A.

We can also conduct a hypothesis test using the inference() function.

inference(x = race, y = satfin, data = provisional, type = "ht", statistic = "proportion", method = "theoretical", sig_level = 0.05, success = "Satisfied", alternative = "greater" )

## Response variable: categorical (3 levels) 
## Explanatory variable: categorical (3 levels) 
## Observed:
##        y
## x       Satisfied More Or Less Not At All Sat
##   White     13455        19095          10271
##   Black      1362         2918           2960
##   Other       527         1163            703
## 
## Expected:
##        y
## x        Satisfied More Or Less Not At All Sat
##   White 12526.1262    18919.806      11375.068
##   Black  2117.8663     3198.884       1923.250
##   Other   700.0075     1057.311        635.682
## 
## H0: race and satfin are independent
## HA: race and satfin are dependent
## chi_sq = 1091.4207, df = 4, p_value = 0

This result agrees with our previous chisq.test() result.

Step V: Interpretation

Because the reported p-value is 0, far less than the \(\alpha\)-value, we can conclude that race and financial satisfaction are indeed largely dependent on each other. This conclusion is line with our preliminary visual interpretation.

Research Objective II: Confidence in Financial Institutions

Brief: Compare the proportion of positive confidence level in financial instiutions in 2008 to that in 2012.

Variables of Interest:

year: year in which the survey was performed.
confinan: confidence level in financial instiutions and banks.

#creating relevant data frame
financial_confidence<-select(gss, year, confinan)

#filtering out NA values
financial_confidence<-filter(financial_confidence, !is.na(confinan), year)

#plotting confidence levels by year
ggplot(financial_confidence, aes(year, fill = confinan)) + geom_bar() + labs(x="Year", y="Confidence Levels")

Negative confidence levels (i.e. "Hardly Any") are on the rise since 1972, as can be seen from the blue-coloured bar in the above graph.

Next, I’ll plot the proportion of positive confidence levels in financial institutions (i.e. "A Great Deal") per year.

#data frame to store proportional value of positive confidence
proportional_financial_confidence<-financial_confidence%>%group_by(year)%>%summarise(confident = sum(confinan == "A Great Deal")/n())

#plotting proportion of positive confidence by year
ggplot(proportional_financial_confidence, aes(year, confident)) + geom_col() +  labs(x="Year", y="Proportional Confidence Levels")

The plot, in general, follows the US Economy patterns. For example, the confidence in financial institutions drops around 1990, which is in line with the US Economy Recession.
At the same time, the confidence levels rise from 1993 to 2000 when the USA experienced the Economic Boom: an extended period of economic prosperity, during which GDP increased continuously for almost ten years.
Visually, the proportion of positive confidence level in 2008 is actually greater than that in 2012. This is inconsistent with our expectations (we expected +ve confidence levels to plummet in 2008 due to the economic crisis).
Perhaps there is a confounding variable at play here, but we cannot determine the cause of inconsistency from the given data.

I want to determine whether the difference in proportions is statistically signifant, or whether it is simply due to variability. This calls for another statistical inference execution.

For the purpose of this inference, I will assume "A Great Deal" as success, and "Only Some" & "Hardly Any" as failures.

#investigating years 2008 & 2012
fincon_06_08<-filter(financial_confidence, year ==2008|year==2012)

#creating backup data frame
fincon_backup<-fincon_06_08

#replacing "Hardly Any" and "Only Some" levels with "failure" level
fincon_06_08<-str_replace(fincon_06_08$confinan, "Hardly Any", "failure")
fincon_backup$confinan<-fincon_06_08

fincon_06_08<-str_replace(fincon_backup$confinan, "Only Some", "failure")
fincon_backup$confinan<-fincon_06_08

fincon_backup$confinan<-as.factor(fincon_backup$confinan)

fincon_backup_summary<-fincon_backup%>%group_by(year)%>%summarise(prop_success = sum(confinan == "A Great Deal")/n())

#plotting proportion of positive confidence in 2008 & 2012
ggplot(fincon_backup_summary, aes(year, prop_success)) + geom_col() + labs(x="Year", y="Proportion of Success")

fincon_backup$year<-as.factor(fincon_backup$year)

Step I: State Hypotheses at \(\alpha\) = 0.05

H₀: p₂₀₀₈ = p₂₀₁₂ (there is no difference in the proportion of success in 2008 and 2012)
H_A: p₂₀₀₈ > p₂₀₁₂ (2008 confidence levels are greater than those of 2012)

Step II: Conditions for Central Limit Theorem

The conditions are as follows:

Independence: sample size < 10% of the population

fincon_backup%>%group_by(year)%>%summarise(n())

## # A tibble: 2 x 2
##   year  `n()`
##   <fct> <int>
## 1 2008   1350
## 2 2012   1325

fincon_backup_summary

## # A tibble: 2 x 2
##    year prop_success
##   <int>        <dbl>
## 1  2008        0.195
## 2  2012        0.112

n₂₀₀₈ = 1350
n₂₀₁₂ = 1325

We can safely conclude that these sample sizes are less than 10% of the population of the United States, therefore this condition is met.

Success-failure condition: There should be at least 10 failures and 10 successes in each sample.

(n₂₀₀₈)(\(\hat{p}\)₂₀₀₈) = 1350*0.195 = 263.5
(n₂₀₁₂)(\(\hat{p}\)₂₀₁₂) = 0.112*1325 = 148.4

As we can see, the success-failure condition is met in both the samples.

Thus, we can use the Central Limit Theorem.

Step III: Methodology

There are 2 categorical variables, both being binary. Therefore, we use the z-statistic to calculate the p-value.

The Confidence Interval for the difference in 2 population proportions (Z = 1.96 for 95% Confidence Level)

\(\hat{p}\)₂₀₀₈ - \(\hat{p}\)₂₀₁₂ +/- ME

where Margin of Error (ME) = Z(S.E.)*

Step IV: Inference

inference(y = confinan, x = year, data = fincon_backup, statistic = "proportion", type = "ht", null = 0, alternative = "greater", method = "theoretical", success = "A Great Deal")

## Response variable: categorical (2 levels, success: A Great Deal)
## Explanatory variable: categorical (2 levels) 
## n_2008 = 1350, p_hat_2008 = 0.1948
## n_2012 = 1325, p_hat_2012 = 0.1125
## H0: p_2008 =  p_2012
## HA: p_2008 > p_2012
## z = 5.9003
## p_value = < 0.0001

Step V: Interpretation

The p-value is less than \(\alpha\) = 0.05 significance level, which means we can reject the H₀ in favour of the H_A.
The probability of obtaining a difference in sample proportions of 8.23% or greater, if in fact there was no difference in the sampled proportions, is less than 0.0001.
Thus, there is convincing evidence in favour of H_A and the difference between the confidence levels is indeed statistically signifant.

As corroborative proof, let’s calculate the Confidence Interval for the same data:

inference(y = confinan, x = year, data = fincon_backup, statistic = "proportion", type = "ci", method = "theoretical", success = "A Great Deal")

## Response variable: categorical (2 levels, success: A Great Deal)
## Explanatory variable: categorical (2 levels) 
## n_2008 = 1350, p_hat_2008 = 0.1948
## n_2012 = 1325, p_hat_2012 = 0.1125
## 95% CI (2008 - 2012): (0.0552 , 0.1095)

As reported, a 95% Confidence Interval for (p₂₀₀₈ - p₂₀₁₂) is (0.0552 , 0.1095). This agrees with our hypothesis testing, since 0 is not included in the interval.

Research Objective III: Political Party Affiliation

Brief: Is there a difference in the income levels of americans affiliated with different political parties & views, and is that difference statistically significant?

Variables of Interest:

partyid: Respondent’s political party affiliation.
coninc: Respondent’s inflation-adjusted family income, in dollars.

income_by_party<-gss%>%filter(!is.na(partyid), !is.na(coninc))%>%select(partyid, coninc)

hist(income_by_party$coninc, xlab = "Inflation-adjusted Family Income", main = "Family Income in Constant Dollars")

The income graph is heavily right-skewed. Ideally, I would like to use the median as a meausre of central tendancy, but because I want to only investigate the difference in income levels and not the actual income levels, I will utilise mean of the sample so that I can use ANOVA tests.

Moving on to Statistical Inference:

Step I: State Hypotheses at \(\alpha\) = 0.05

H₀: The mean is same across all the categories. Since we have 8 levels in the partyid variable:

\(\mu\)₁ = \(\mu\)₂ = \(\mu\)₃ = \(\mu\)₄ = \(\mu\)₅ = \(\mu\)₆ = \(\mu\)₇ = \(\mu\)₈
H_A: There exists at least one pair of means in which the means are not equal.

Step II: Conditions for ANOVA

Independence

within groups: the random sampling methodology ensures this condition.
between groups: the means are non-paired because income levels of a certain political party affiliation group does not affect those of another.

Thus, we can conclude that the samples are independent.

Approximate normality

Even though the samples are skewed, we have a large population, which offsets this disadvantage.

income_by_party%>%group_by(partyid)%>%summarise(n())

## # A tibble: 8 x 2
##   partyid            `n()`
##   <fct>              <int>
## 1 Strong Democrat     8232
## 2 Not Str Democrat   10995
## 3 Ind,Near Dem        6201
## 4 Independent         7223
## 5 Ind,Near Rep        4525
## 6 Not Str Republican  8166
## 7 Strong Republican   4940
## 8 Other Party          775

Equal Variance

This condition demands that the groups have roughly equal variability.
As we can see from the following side-by-side boxplots, all of the groups have similar variabilities except for the "Strong Republican" group, which has a slightly greater variability. So we can assume this condition to be met.

ggplot(income_by_party, aes(partyid, coninc)) + geom_boxplot() + theme(axis.text.x=element_text(angle=40, hjust=1)) + labs(x="Political Party Affiliation", y="Inflation-adjusted Family Income")

Step III: Methodology

We’re comparing means from 8 different, non-paired groups. This called for the Analysis of Variance (ANOVA) test.

Step IV: Inference

inference(y = coninc, x = partyid, data = income_by_party, statistic = "mean", type = "ht", method = "theoretical", alternative = "greater")

## Response variable: numerical
## Explanatory variable: categorical (8 levels) 
## n_Strong Democrat = 8232, y_bar_Strong Democrat = 37970.1833, s_Strong Democrat = 33195.5316
## n_Not Str Democrat = 10995, y_bar_Not Str Democrat = 41936.0819, s_Not Str Democrat = 33045.6652
## n_Ind,Near Dem = 6201, y_bar_Ind,Near Dem = 42827.115, s_Ind,Near Dem = 34179.2018
## n_Independent = 7223, y_bar_Independent = 38974.2352, s_Independent = 33341.2145
## n_Ind,Near Rep = 4525, y_bar_Ind,Near Rep = 49183.8608, s_Ind,Near Rep = 37136.5051
## n_Not Str Republican = 8166, y_bar_Not Str Republican = 51707.0563, s_Not Str Republican = 38281.6394
## n_Strong Republican = 4940, y_bar_Strong Republican = 55154.2346, s_Strong Republican = 41427.8594
## n_Other Party = 775, y_bar_Other Party = 45588.52, s_Other Party = 38599.9933
## 
## ANOVA:
##              df           Sum_Sq          Mean_Sq        F  p_value
## partyid       7 1746245054827.48 249463579261.069 198.4193 < 0.0001
## Residuals 51049 64181598251474.8  1257254760.1613                  
## Total     51056 65927843306302.3                                   
## 
## Pairwise tests - t tests with pooled SD:
##                group1             group2    p.value
## 1    Not Str Democrat    Strong Democrat  1.696e-14
## 2        Ind,Near Dem    Strong Democrat  3.839e-16
## 3         Independent    Strong Democrat  7.903e-02
## 4        Ind,Near Rep    Strong Democrat  2.711e-65
## 5  Not Str Republican    Strong Democrat 5.019e-135
## 6   Strong Republican    Strong Democrat 1.333e-158
## 7         Other Party    Strong Democrat  1.082e-08
## 8        Ind,Near Dem   Not Str Democrat  1.136e-01
## 9         Independent   Not Str Democrat  3.502e-08
## 10       Ind,Near Rep   Not Str Democrat  6.157e-31
## 11 Not Str Republican   Not Str Democrat  4.244e-79
## 12  Strong Republican   Not Str Democrat 1.515e-104
## 13        Other Party   Not Str Democrat  5.580e-03
## 14        Independent       Ind,Near Dem  3.489e-10
## 15       Ind,Near Rep       Ind,Near Dem  4.925e-20
## 16 Not Str Republican       Ind,Near Dem  6.771e-50
## 17  Strong Republican       Ind,Near Dem  5.146e-74
## 18        Other Party       Ind,Near Dem  4.095e-02
## 19       Ind,Near Rep        Independent  5.567e-52
## 20 Not Str Republican        Independent 5.603e-109
## 21  Strong Republican        Independent 4.497e-134
## 22        Other Party        Independent  8.039e-07
## 23 Not Str Republican       Ind,Near Rep  1.233e-04
## 24  Strong Republican       Ind,Near Rep  2.836e-16
## 25        Other Party       Ind,Near Rep  9.103e-03
## 26  Strong Republican Not Str Republican  6.934e-08
## 27        Other Party Not Str Republican  4.424e-06
## 28        Other Party  Strong Republican  2.935e-12

The reported F-value is very large (198.4193), which yields a p-value of less than 0.0001.
The total degrees of freedom value is 51,056.

Step V: Interpretation

Because of the small p-value (<\(\alpha\) = 0.05), we reject the null hypothesis in favour of the H_A.
Thus, there is convincing evidence of a difference in means of at least one pair of means. However, we cannot identify which pair of means are different with an ANOVA test.

Part 4: Inference

Inferences and interpretations have already been discussed in detail in part 3 under each Research Objective. The purpose of this part is to recap the interpretations and conclude the analysis of the GSS Data.

Interpretation for Objective I

A \(\chi^2\) test was conducted to test the stated hypotheses.
Because the reported p-value was found to be nearly 0, (far less than the \(\alpha\)-value), it was concluded that race and financial satisfaction are indeed largely dependent on each other.

Interpretation for Objective II

We tested for difference in proportions of success in 2008 and 2012.
Based on the p-value, we rejected the H₀ in favour of the H_A. Although the difference in proporitons was found to be statisticallt signifant, it was not in line with the expected order of difference. The 2008 confidence level was greater than the 2012 confidence level.

Interpretation for Objective III

We tested for a difference in at least one pair of means of the income levels of groups of people with various poliitical party affiliatoins. Because of the small p-value (<\(\alpha\) = 0.05), we rejected the null hypothesis in favour of the H_A.
However, the ANOVA test does not provide information on which we which pair of means are different.

Statistical Inference of The General Social Survey Data

Saurabh Bodas

Setup

Load Packages

Load the GSS Data

Part 1: Data

Part 2: Research Objectives

Part 3: Exploratory Data Analysis

Research Objective I: Personal Financial Situation

Research Objective II: Confidence in Financial Institutions

Research Objective III: Political Party Affiliation

Part 4: Inference

THE END